System and apparatus for group floating-point inflate and deflate operations

ABSTRACT

Systems and apparatuses are presented relating a programmable processor comprising an execution unit that is operable to decode and execute instructions received from an instruction path and partition data stored in registers in the register file into multiple data elements, the execution unit capable of executing group data handling operations that re-arrange data elements in different ways in response to data handling instructions, the execution unit further capable of executing a plurality of different group floating-point and group integer arithmetic operations that each arithmetically operates on the multiple data elements stored in registers in the register file to produce a catenated result that is returned to a register in the register file, wherein the catenated result comprises a plurality of individual results.

CROSS-REFERENCES TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.13/493,738, filed Jun. 11, 2012, which is a continuation of U.S. patentapplication Ser. No. 13/310,508, filed Dec. 2, 2011, which is acontinuation of U.S. patent application Ser. No. 11/878,803, filed Jul.27, 2007, now U.S. Pat. No. 8,117,426, which is a continuation of U.S.patent application Ser. No. 10/436,340, filed May 13, 2003, now U.S.Pat. No. 7,516,308, which is a continuation of U.S. patent applicationSer. No. 09/534,745, filed Mar. 24, 2000, now U.S. Pat. No. 6,643,765,which is a continuation of U.S. patent application Ser. No. 09/382,402,filed Aug. 24, 1999, now U.S. Pat. No. 6,295,599, and which is acontinuation-in-part of U.S. patent application Ser. No. 09/169,963,filed Oct. 13, 1998, now U.S. Pat. No. 6,006,318, which is acontinuation of U.S. patent application Ser. No. 08/754,827, filed Nov.22, 1996, now U.S. Pat. No. 5,822,603, which is a division of U.S.patent application Ser. No. 08/516,036, filed Aug. 16, 1995, now U.S.Pat. No. 5,742,840.

This application is a continuation of U.S. patent application Ser. No.11/878,803, filed Jul. 27, 2007, which is a continuation of U.S. patentapplication Ser. No. 11/511,466, filed Aug. 29, 2006, now abandoned,which is a continuation of U.S. patent application Ser. No. 10/646,787,filed Aug. 25, 2003, now U.S. Pat. No. 7,216,217, which is acontinuation of U.S. patent application Ser. No. 09/922,319, filed Aug.2, 2001, which is a continuation of U.S. patent application Ser. No.09/382,402, filed Aug. 24, 1999, now U.S. Pat. No. 6,295,599, whichclaims the benefit of priority to Provisional Application No. 60/097,635filed Aug. 24, 1998, and is a continuation-in-part of U.S. patentapplication Ser. No. 09/169,963, filed Oct. 13, 1998, now U.S. Pat. No.6,006,318, which is a continuation of U.S. patent application Ser. No.08/754,827, filed Nov. 22, 1996 now U.S. Pat. No. 5,822,603, which is adivisional of U.S. patent application Ser. No. 08/516,036, filed Aug.16, 1995 now U.S. Pat. No. 5,742,840.

The contents of all the U.S. patent applications and provisionalapplications listed above are hereby incorporated by reference includingtheir appendices in their entirety.

FIELD OF THE INVENTION

The present invention relates to general purpose processorarchitectures, and particularly relates to general purpose processorarchitectures capable of executing group operations.

BACKGROUND OF THE INVENTION

The performance level of a processor, and particularly a general purposeprocessor, can be estimated from the multiple of a plurality ofinterdependent factors: clock rate, gates per clock, number of operands,operand and data path width, and operand and data path partitioning.Clock rate is largely influenced by the choice of circuit and logictechnology, but is also influenced by the number of gates per clock.Gates per clock is how many gates in a pipeline may change state in asingle clock cycle. This can be reduced by inserting latches into thedata path: when the number of gates between latches is reduced, a higherclock is possible. However, the additional latches produce a longerpipeline length, and thus come at a cost of increased instructionlatency. The number of operands is straightforward; for example, byadding with carry-save techniques, three values may be added togetherwith little more delay than is required for adding two values. Operandand data path width defines how much data can be processed at once;wider data paths can perform more complex functions, but generally thiscomes at a higher implementation cost. Operand and data pathpartitioning refers to the efficient use of the data path as width isincreased, with the objective of maintaining substantially peak usage.

SUMMARY OF THE INVENTION

Embodiments of the invention pertain to systems and methods forenhancing the utilization of a general purpose processor by addingclasses of instructions. These classes of instructions use the contentsof general purpose registers as data path sources, partition theoperands into symbols of a specified size, perform operations inparallel, catenate the results and place the catenated results into ageneral-purpose register. Some embodiments of the invention relate to ageneral purpose microprocessor which has been optimized for processingand transmitting media data streams through significant parallelism.

Some embodiments of the present invention provide a system and methodfor improving the performance of general purpose processors by includingthe capability to execute group operations involving multiplefloating-point operands. In one embodiment, a programmable mediaprocessor comprises a virtual memory addressing unit, a data path, aregister file comprising a plurality of registers coupled to the datapath, and an execution unit coupled to the data path capable ofexecuting group-floating point operations in which multiplefloating-point operations stored in partitioned fields of one or more ofthe plurality of registers are operated on to produce catenated results.The group floating-point operations may involve operating on at leasttwo of the multiple floating-point operands in parallel. The catenatedresults may be returned to a register, and general purpose registers mayused as operand and result registers for the floating-point operations.In some embodiments the execution unit may also be capable of performinggroup floating-point operations on floating-point data of more than oneprecision. In some embodiments the group floating-point operations mayinclude group add, group subtract, group compare, group multiply andgroup divide arithmetic operations that operate on catenatedfloating-point data. In some embodiments, the, group floating-pointoperations may include group multiply-add, group scale-add, and groupset operations that operate on catenated floating-point data.

In one embodiment, the execution unit is also capable of executing groupinteger instructions involving multiple integer operands stored inpartitioned fields of registers. The group integer operations mayinvolve operating on at least two of the multiple integer operands inparallel. The group integer operations may include group add, groupsubtract, group compare, and group multiply arithmetic operations thatoperate on catenated integer data.

In one embodiment, the execution unit is capable of performing groupdata handling operations, including operations that copy, operationsthat shift, operations that rearrange and operations that resizecatenated integer data stored in a register and return catenatedresults. The execution unit may also be configurable to perform groupdata handling operations on integer data having a symbol width of 8bits, group data handling operations on integer data having a symbolwidth of 16 bits, and group data handling operations on integer datahaving a symbol width of 32 bits. In one embodiment, the operations arecontrolled by values in a register operand. In one embodiment, theoperations are controlled by values in the instruction.

In one embodiment, the multi-precision execution unit is capable ofexecuting a Galois field instruction operation.

In one embodiment, the multi-precision execution unit is configurable toexecute a plurality of instruction streams in parallel from a pluralityof threads, and the programmable media processor further comprises aregister file associated with each thread executing in parallel on themulti-precision execution unit to support processing of the plurality ofthreads. In some embodiments, the multi-precision execution unitexecutes instructions from the plurality of instruction streams in around-robin manner. In some embodiments, the processor ensures only onethread from the plurality of threads can handle an exception at anygiven time.

Some embodiments of the present invention provide a multiplier arraythat is fully used for high precision arithmetic, but is only partlyused for other, lower precision operations. This can be accomplished byextracting the high-order portion of the multiplier product or sum ofproducts, adjusted by a dynamic shift amount from a general register oran adjustment specified as part of the instruction, and rounded by acontrol value from a register or instruction portion. The rounding maybe any of several types, including round-to-nearest/even; toward zero,floor, or ceiling. Overflows are typically handled by limiting theresult to the largest and smallest values that can be accuratelyrepresented in the output result.

When an extract is controlled by a register, the size of the result canbe specified, allowing rounding and limiting to a smaller number of bitsthan can fit in the result. This permits the result to be scaled for usein subsequent operations without concern of overflow or rounding. As aresult, performance is enhanced. In those instances where the extract iscontrolled by a register, a single register value defines the size ofthe operands, the shift amount and size of the result, and the roundingcontrol. By placing such control information in a single register, thesize of the instruction is reduced over the number of bits that such aninstruction would otherwise require, again improving performance andenhancing processor flexibility. Exemplary instructions are EnsembleConvolve Extract, Ensemble Multiply Extract, Ensemble Multiply AddExtract, and Ensemble Scale Add Extract. With particular regard to theEnsemble Scale Add Extract Instruction, the extract control informationis combined in a register with two values used as scalar multipliers tothe contents of two vector multiplicands. This combination reduces thenumber of registers otherwise required, thus reducing-the number of bitsrequired for the instruction.

In one embodiment, the processor performs load and store instructionsoperable to move values between registers and memory. In one embodiment,the processor performs both instructions that verify alignment of memoryoperands and instructions that permit memory operands to be unaligned.In one embodiment, the processor performs store multiplex instructionsoperable to move to memory a portion of data contents controlled by acorresponding mask contents. In one embodiment, this masked storageoperation is performed by indivisibly reading-modifying-writing a memoryoperand.

In one embodiment, all processor, memory and interface resources aredirectly accessible to high-level language programs. In one embodiment,assembler codes and high-level language formats are specified to accessenhanced instructions. In one embodiment interface and system state ismemory mapped, so that it can be manipulated by compiled code. In oneembodiment, software libraries provide other operations required by theANSI/IEEE floating-point standard. In one embodiment, softwareconventions are employed at software module boundaries, in order topermit the combination of separately compiled code and to providestandard interfaces between application, library and system software. Inone embodiment, instruction scheduling is performed by a compiler.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a system level diagram showing the functional blocks of asystem according to the present invention.

FIG. 2 is a matrix representation of a wide matrix multiply inaccordance with one embodiment of the present invention.

FIG. 3 is a further representation of a wide matrix multiple inaccordance with one embodiment of the present invention.

FIG. 4 is a system level diagram showing the functional blocks of asystem incorporating a combined Simultaneous Multi Threading andDecoupled Access from Execution processor in accordance with oneembodiment of the present invention.

FIG. 5 illustrates a wide operand in accordance with one embodiment ofthe present invention.

FIG. 6 illustrates an approach to specifier decoding in accordance withone embodiment of the present invention.

FIG. 7 illustrates in operational block form a Wide Function Unit inaccordance with one embodiment of the present invention.

FIG. 8 illustrates in flow diagram form the Wide Microcache controlfunction.

FIG. 9 illustrates Wide Microcache data structures.

FIGS. 10 and 11 illustrate a Wide Microcache control.

FIG. 12 is a timing diagram of a decoupled pipeline structure inaccordance with one embodiment of the present invention.

FIG. 13 further illustrates the pipeline organization of FIG. 12.

FIG. 14 is a diagram illustrating the basic organization of the memorymanagement system according to the present embodiment of the invention.

FIG. 15 illustrates the physical address of an LTB entry for thread th,entry en, byte b.

FIG. 16 illustrates a definition for AccessPhysicalLTB.

FIG. 17 illustrates how various 16-bit values are packed together into a64-bit LTB entry.

FIG. 18 illustrates global access as fields of a control register.

FIG. 19 shows how a single-set LTB context may be further simplified byreserving the implementation of the lm and la registers.

FIG. 20 shows the partitioning of the virtual address space if thelargest possible space is reserved for an address space identifier.

FIG. 21 shows how the L TB protect field controls the minimum privilegelevel required for each memory action of read (r), write (w), execute(x), and gateway (g), as well as memory and cache attributes of writeallocate (wa), detail access (da), strong ordering (so), cache disable(cd), and write through (wt).

FIG. 22 illustrates a definition for LocalTranslation.

FIG. 23 shows how the low-order GT bits of the th value are ignored,reflecting that 2 GT threads share a single GTB.

FIG. 24 illustrates a definition for AccessPhysicalGTB.

FIG. 25 illustrates the format of a GTB entry.

FIG. 26 illustrates a definition for GlobalAddressTranslation.

FIG. 27 illustrates a definition for GTBUpdateWrite.

FIG. 28 shows how the low-order GT bits of the th value are ignored,reflecting that 2 GT threads share single GTB registers.

FIG. 29 illustrates the registers GTBLast, GTBFirst, and GTBBump.

FIG. 30 illustrates a definition for AccessPhysicalGTBRegisters.

FIGS. 31A-31 C illustrate Group Boolean instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 31D-31E illustrate Group Multiplex instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 32A-32C illustrate Group Add instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 33A-33C illustrate Group Subtract and Group Set instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 34A-34C illustrate Ensemble Divide and Ensemble Multiplyinstructions in accordance with an exemplary embodiment of the presentinvention.

FIGS. 35A-35C illustrate Group Compare instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 36A-36C illustrate Ensemble Unary instructions in accordance withan exemplary embodiment of the present invention.

FIG. 37 illustrates exemplary functions that are defined for use withinthe detailed instruction definitions in other sections.

FIGS. 38A-38C illustrate Ensemble Floating-Point Add, EnsembleFloating-Point Divide, and Ensemble Floating-Point Multiply instructionsin accordance with an exemplary embodiment of the present invention.

FIGS. 38D-38F illustrate Ensemble Floating-Point Multiply Addinstructions in accordance with an exemplary embodiment of the presentinvention

FIGS. 38G-38I illustrate Ensemble Floating-Point Scale Add instructionsin accordance with an exemplary embodiment of the present invention.

FIGS. 39A-39C illustrate Ensemble Floating-Point Subtract instructionsin accordance with an exemplary embodiment of the present invention.

FIGS. 39D-39G illustrate Group Set Floating-point instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 40A-40C illustrate Group Compare Floating-point instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 41A-41 C illustrate Ensemble Unary Floating-point instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 42A-42D illustrate Ensemble Multiply Galois Field instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 43A-43D illustrate Compress, Expand, Rotate, and Shiftinstructions in accordance with an exemplary embodiment of the presentinvention.

FIGS. 43E-43G illustrate Shift Merge instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 43H-43J illustrate Compress Immediate, Expand Immediate, RotateImmediate, and Shift Immediate instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 43K-43M illustrate Shift Merge Immediate instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 44A-44D illustrate Crossbar Extract instructions in accordancewith an exemplary embodiment of the present invention.

FIGS. 44E-44K illustrate Ensemble Extract instructions in accordancewith an exemplary embodiment of the present invention.

FIGS. 45A-45F illustrate Deposit and Withdraw instructions in accordancewith an exemplary embodiment of the present invention.

FIGS. 45G-45J illustrate Deposit Merge instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 46A-46E illustrate Shuffle instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 47A-47C illustrate Swizzle instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 47D-47E illustrate Select instructions in accordance with anexemplary embodiment of the present invention.

FIG. 48 is a pin summary describing the functions of various pins inaccordance with the one embodiment of the present invention.

FIGS. 49A-49G present electrical specifications describing AC and DCparameters in accordance with one embodiment of the present invention.

FIGS. 50A-50C illustrate Load instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 51A-51C illustrate Load Immediate instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 52A-52C illustrate Store and Store Multiplex instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 53A-53C illustrate Store Immediate and Store Multiplex Immediateinstructions in accordance with an exemplary embodiment of the presentinvention.

FIGS. 54A-54E illustrate Data-Handling Operations in accordance with anexemplary embodiment of the present invention.

FIG. 54F illustrates Procedure Calling Conventions in accordance with anexemplary embodiment of the present invention.

FIG. 54G illustrates alignment withing the dp region in accordance withan exemplary embodiment of the present invention.

FIG. 54H illustrates gateway with pointers to code and data spaces inaccordance with an exemplary embodiment of the present invention.

FIGS. 55-56 illustrate an expected rate at which memory requests areserviced in accordance with an exemplary embodiment of the presentinvention.

FIG. 57 is a pinout diagram in accordance with an exemplary embodimentof the present invention.

FIGS. 58A-58C illustrate Always Reserved instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 59A-59C illustrate Address instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 60A-60C illustrate Address Compare instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 61A-61C illustrate Address Copy Immediate instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 62A-62C illustrate Address Immediate instructions in accordancewith an exemplary embodiment of the present invention.

FIGS. 63A-63C illustrate Address Immediate Reversed instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 64A-64C illustrate Address Reversed instructions in accordancewith an exemplary embodiment of the present invention.

FIGS. 65A-65C illustrate Address Shift Left Immediate Add instructionsin accordance with an exemplary embodiment of the present invention.

FIGS. 66A-66C illustrate Address Shift Left Immediate Subtractinstructions in accordance with an exemplary embodiment of the presentinvention.

FIGS. 67 A-67C illustrate Address Shift Left Immediate instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 68A-68C illustrate Address Ternary instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 69A-69C illustrate Branch instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 70A-70C illustrate Branch Back instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 71A-71C illustrate Branch Barrier instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 72A-72C illustrate Branch Conditional instructions in accordancewith an exemplary embodiment of the present invention.

FIGS. 73A-73C illustrate Branch Conditional Floating-Point instructionsin accordance with an exemplary embodiment of the present invention.

FIGS. 74A-74C illustrate Branch Conditional Visibility Floating-Pointinstructions in accordance with an exemplary embodiment of the presentinvention.

FIGS. 75A-75C illustrate Branch Down instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 76A-76C illustrate Branch Gateway instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 77A-77C illustrate Branch Halt instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 78A-78C illustrate Branch Hint instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 79A-79C illustrate Branch Hint Immediate instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 80A-80C illustrate Branch Immediate instructions in accordancewith an exemplary embodiment of the present invention.

FIGS. 81A-81C illustrate Branch Immediate Link instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 82A-82C illustrate Branch Link instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 83A-83C illustrate Store Double Compare Swap instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 84A-84C illustrate Store Immediate Inplace instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 85A-85C illustrate Store Inplace instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 86A-86C illustrate Group Add Halve instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 87 A-87C illustrate Group Copy Immediate instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 88A-88C illustrate Group Immediate instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 89A-89C illustrate Group Immediate Reversed instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 90A-90C illustrate Group Inplace instructions in accordance withan exemplary embodiment of the present invention.

FIGS. 91A-91C illustrate Group Shift Left Immediate instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 92A-92C illustrate Group Shift Left Immediate Subtractinstructions in accordance with an exemplary embodiment of the presentinvention.

FIGS. 93A-93C illustrate Group Subtract Halve instructions in accordancewith an exemplary embodiment of the present invention.

FIGS. 94A-94C illustrate Ensemble instructions in accordance with anexemplary embodiment of the present invention.

FIGS. 95A-95E illustrate Ensemble Convolve Extract Immediateinstructions in accordance with an exemplary embodiment of the presentinvention.

FIGS. 96A-96E illustrate Ensemble Convolve Floating-Point instructionsin accordance with an exemplary embodiment of the present invention.

FIGS. 97A-97G illustrate Ensemble Extract Immediate instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 98A-98G illustrate Ensemble Extract Immediate Inplace instructionsin accordance with an exemplary embodiment of the present invention.

FIGS. 99A-99C illustrate Ensemble Inplace instructions in accordancewith an exemplary embodiment of the present invention.

FIGS. 100A-100E illustrate Wide Multiply Matrix instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 101A-101E illustrate Wide Multiply Matrix Extract instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 102A-102E illustrate Wide Multiply Matrix Extract Immediateinstructions in accordance with an exemplary embodiment of the presentinvention.

FIGS. 103A-103E illustrate Wide Multiply Matrix Floating-Point Immediateinstructions in accordance with an exemplary embodiment of the presentinvention.

FIGS. 104A-104D illustrate Wide Multiply Matrix Galois Immediateinstructions in accordance with an exemplary embodiment of the presentinvention.

FIGS. 105A-105C illustrate Wide Switch Immediate instructions inaccordance with an exemplary embodiment of the present invention.

FIGS. 106A-106C illustrate Wide Translate instructions in accordancewith an exemplary embodiment of the present invention.

FIG. 107 shows the timing of what has become a canonical pipelinestructure for a simple RISC processor, with time on the horizontal axisincreasing to the right, and successive instructions on the verticalaxis going downward.

FIG. 108 shows simplified diagrams of FIG. 107 by eliminating the pipestages for instruction fetch, register file fetch, and register filewrite, which can be understood to precede and follow the portions of thepipelines diagrammed.

FIG. 109 shows a two-way superscalar processor, where one instructionmay be a register-to-register operation (using stage E) and the othermay be a register-to-register operation (using stage A) or a memory loador store.

FIG. 110 shows two-cycle superpipelined implementation.

FIG. 111 shows a box for the interval between issue of each instructionand the completion.

FIG. 112 shows a refinement to the organization shown in FIG. 111, inwhich the time permitted by the pipeline to service load operations maybe flexibly extended

FIG. 113 shows that the thread number which issues an instruction isindicated on each clock cycle, and below it, a list of which functionalunits may be used by that instruction.

FIG. 114 shows that the resource use diagram looks similar to that ofthe collection when seen from the perspective of an individual thread.

FIG. 115 shows the address presented to LOC data bank bn.

FIG. 116 shows a table in which the bt field specifies theleast-significant bit used for tag.

FIG. 117 shows the address presented to LOC data bank (ci_(1..0)∥si₁).

FIG. 118 shows the physical address of a LOC hexlet for LOC address ba,bank bn, byte b.

FIG. 119 shows the LOC address (pa_(17..7)) presented to LOC data bank(pa_(6..4)).

DETAILED DESCRIPTION OF THE INVENTION Introduction

In various embodiments of the invention, a computer processorarchitecture, referred to here as MicroUnity's Zeus Architecture ispresented. MicroUnity's Zeus Architecture describes general-purposeprocessor, memory, and interface subsystems, organized to operate at theenormously high bandwidth rates required for broadband applications.

The Zeus processor performs integer, floating point, signal processingand non-linear operations such as Galois field, table lookup and bitswitching on data sizes from 1 bit to 128 bits. Group or SIMD (singleinstruction multiple data) operations sustain external operand bandwidthrates up to 512 bits (i.e., up to four 128-bit operand groups) perinstruction even on data items of small size. The processor performsensemble operations such as convolution that maintain full intermediateprecision with aggregate internal operand bandwidth rates up to 20,000bits per instruction. The processor performs wide operations such ascrossbar switch, matrix multiply and table lookup that use cachesembedded in the execution units themselves to extend operands to as muchas 32768 bits. All instructions produce at most a single 128-bitregister result, source at most three 128-bit registers and are free ofside effects such as the setting of condition codes and flags. Theinstruction set design carries the concept of streamlining beyondReduced Instruction Set Computer (RISC) architectures, to simplifyimplementations that issue several instructions per machine cycle.

The Zeus memory subsystem provides 64-bit virtual and physicaladdressing for UNIX, Mach, and other advanced OS environments. Separateaddress instructions enable the division of the processor into decoupledaccess and execution units, to reduce the effective latency of memory tothe pipeline. The Zeus cache supplies the high data and instructionissue rates of the processor, and supports coherency primitives forscaleable multiprocessors. The memory subsystem includes mechanisms forsustaining high data rates not only in block transfer modes, but also innon-unit stride and scatterred access patterns.

The Zeus interface subsystem is designed to match industry-standard“Socket 7” protocols and pin-outs. In this way, Zeus can make use of theimmense infrastructure of the PC for building low-cost systems. Theinterface subsystem is modular, and can be replaced with appropriateprotocols and pin-outs for lower-cost and higher-performance systems.

The goal of the Zeus architecture is to integrate these processor,memory, and interface capabilities with optimal simplicity andgenerality. From the software perspective, the entire machine stateconsists of a program counter, a single bank of 64 general-purpose128-bit registers, and a linear byte-addressed shared memory space withmapped interface registers. All interrupts and exceptions are precise,and occur with low overhead.

Examples discussed herein are for Zeus software and hardware developersalike, and defines the interface at which their designs must meet. Zeuspursues the most efficient tradeoffs between hardware and softwarecomplexity by making all processor, memory, and interface resourcesdirectly accessible to high-level language programs.

Conformance

To ensure that Zeus systems may freely interchange data, user-levelprograms, system-level programs and interface devices, the Zeus systemarchitecture reaches above the processor level architecture.

Optional Areas

Optional areas include:

Number of processor threads

Size of first-level cache memories

Existence of a second-level cache

Size of second-level cache memory

Size of system-level memory

Existence of certain optional interface device interfaces

Upward-Compatible Modifications

Additional devices and interfaces, not covered by this standard may beadded in specified regions of the physical memory space, provided thatsystem reset places these devices and interfaces in an inactive statethat does not interfere with the operation of software that runs in anyconformant system. The software interface requirements of any suchadditional devices and interfaces must be made as widely available asthis architecture specification.

Unrestricted Physical Implementation

Nothing in this specification should be construed to limit theimplementation choices of the conforming system beyond the specificrequirements stated herein. In particular, a computer system may conformto the Zeus System Architecture while employing any number ofcomponents, dissipate any amount of heat, require any specialenvironmental facilities, or be of any physical size.

Common Elements Notation

The descriptive notation used in this document is summarized in thetable below:

descriptive notation x + y two's complement addition of x and y. Resultis the same size as the operands, and operands must be of equal size. x− y two's complement subtraction of y from x. Result is the same size asthe operands, and operands must be of equal size. x * y two's complementmultiplication of x and y. Result is the same size as the operands, andoperands must be of equal size. x/y two's complement division of x by y.Result is the same size as the operands, and operands must be of equalsize. x & y bitwise and of x and y. Result is same size as the operands,and operands must be of equal size. x|y bitwise or of x and y. Result issame size as the operands, and operands must be of equal size. x{circumflex over ( )} y bitwise exclusive-OR of x and y. Result is samesize as the operands, and operands must be of equal size. ~x bitwiseinversion of x. Result is same size as the operand. x = y two'scomplement equality comparison between x and y. Result is a single bit,and operands must be of equal size. x ≠ y two's complement inequalitycomparison between x and y. Result is a single bit, and operands must beof equal size. x < y two's complement less than comparison between x andy. Result is a single bit, and operands must be of equal size. x ≧ ytwo's complement greater than or equal comparison between x and y.Result is a single bit, and operands must be of equal size. {square rootover (x)} floating-point square root of x x || y concatenation of bitfield x to left of bit field y x^(y) binary digit x repeated,concatenated y times. Size of result is y. x_(y) extraction of bit y(using little-endian bit numbering) from value x. Result is a singlebit. x_(y . . . z) extraction of bit field formed from bits y through zof value x. Size of result is y − z + 1; if z > y, result is an emptystring, x?y:z value of y, if x is true, otherwise value of z. Value of xis a single bit. x ← y bitwise assignment of x to value of y Sn signed,two's complement, binary data format of n bytes Un unsigned binary dataformat of n bytes Fn floating-point data format of n bytes

Bit Ordering

The ordering of bits in this document is always little-endian,regardless of the ordering of bytes within larger data structures. Thus,the least-significant bit of a data structure is always labeled 0(zero), and the most-significant bit is labeled as the data structuresize (in bits) minus one.

Memory

Zeus memory is an array of 2⁶⁴ bytes, without a specified byte ordering,which is physically distributed among various components.

Byte

A byte is a single element of the memory array, consisting of 8 bits:

Byte Ordering

Larger data structures are constructed from the concatenation of bytesin either little-endian or big-endian byte ordering. A memory access ofa data structure of size s at address i is formed from memory bytes ataddresses i through i+s−1. Unless otherwise specified, there is nospecific requirement of alignment: it is not generally required that ibe a multiple of s. Aligned accesses are preferred whenever possible,however, as they will often require one fewer processor or memory clockcycle than unaligned accesses.

With little-endian byte ordering, the bytes are arranged as:

With big-endian byte ordering, the bytes are arranged as:

Zeus memory is byte-addressed, using either little-endian or big-endianbyte ordering. For consistency with the bit ordering, and forcompatibility with x86 processors, Zeus uses little-endian byte orderingwhen an ordering must be selected. Zeus load and store instructions areavailable for both little-endian and big-endian byte ordering. Theselection of byte ordering is dynamic, so that little-endian andbig-endian processes, and even data structures within a process, can beintermixed on the processor.

Memory Read/Load Semantics

Zeus memory, including memory-mapped registers, must conform to thefollowing requirements regarding side-effects of read or loadoperations:

A memory read must have no side-effects on the contents of the addressedmemory nor on the contents of any other memory.

Memory Write/Store Semantics

Zeus memory, including memory-mapped registers, must conform to thefollowing requirements regarding side-effects of read or loadoperations:

A memory write must affect the contents of the addressed memory so thata memory read of the addressed memory returns the value written, and sothat a memory read of a portion of the addressed memory returns theappropriate portion of the value written.

A memory write may affect or cause side-effects on the contents ofmemory not addressed by the write operation, however, a second memorywrite of the same value to the same address must have no side-effects onany memory; memory write operations must be idempotent.

Zeus store instructions that are weakly ordered may have side-effects onthe contents of memory not addressed by the store itself; subsequentload instructions which are also weakly ordered mayor may not returnvalues which reflect the side-effects.

Data

Zeus provides eight-byte (64-bit) virtual and physical address sizes,and eight-byte (64-bit) and sixteen-byte (128-bit) data path sizes, anduses fixed-length four-byte (32-bit) instructions. Arithmetic isperformed on two's-complement or unsigned binary and ANSI/IEEE standard754-1985 conforming binary floating-point number representations.

Fixed-Point Data Bit

A bit is a primitive data element:

Peck

A peck is the catenation of two bits:

Nibble

A nibble is the catenation of four bits:

Byte

A byte is the catenation of eight bits, and is a single element of thememory array:

Doublet

A doublet is the catenation of 16 bits, and is the catenation of twobytes:

Quadlet

A quadlet is the catenation of 32 bits, and is the catenation of fourbytes:

Oclet

An octlet is the catenation of 64 bits, and is the catenation of eightbytes:

Hexlet

hex let is the catenation of 128 bits, and is the catenation of sixteenbytes:

Triclet

A triclet is the catenation of 256 bits, and is the catenation ofthirty-two bytes:

Address

Zeus addresses, both virtual addresses and physical addresses, areoctlet quantities.

Floating-Point Data

Zeus's floating-point formats are designed to satisfy ANSI/IEEE standard754-1985: Binary Floating-point Arithmetic. Standard 754 leaves certainaspects to the discretion of implementers: additional precision formats,encoding of quiet and signaling NaN values, details of production andpropagation of quiet NaN values. These aspects are detailed below.

Zeus adds additional half-precision and quad-precision formats tostandard 754's single-precision and double-precision formats. Zeus'sdouble-precision satisfies standard 754's precision requirements for asingle-extended format, and Zeus's quad-precision satisfies standard754's precision requirements for a double-extended format.

Each precision format employs fields labeled s (sign), e (exponent), andf (fraction) to encode values that are (1) NaN: quiet and signaling, (2)infinities: (−1)̂^(S)∞, (3) normalized numbers: (−1)̂^(S)2̂^(e-bias)(1.f),denormalized numbers: (−1)̂^(S)2̂^(1-bias)(0.f), and (5) zero: (−1)̂^(S)0.

Quiet NaN values are denoted by any sign bit value, an exponent field ofall one bits, and a non-zero fraction with the most significant bit set.Quiet NaN values generated by default exception handling of standardoperations have a zero sign bit, an exponent field of all one bits, afraction field with the most significant bit set, and all other bitscleared.

Signaling NaN values are denoted by any sign bit value, an exponentfield of all one bits, and a non-zero fraction with the most significantbit cleared.

Infinite values are denoted by any sign bit value, an exponent field ofall one bits, and a zero fraction field.

Normalized number values are denoted by any sign bit value, an exponentfield that is not all one bits or all zero bits, and any fraction fieldvalue. The numeric value encoded is (−1)̂^(S)2̂^(e-bias)(1.f). The bias isequal the value resulting from setting all but the most significant bitof the exponent field, half: 15, single: 127, double: 1023, and quad:16383.

Denormalized number values are denoted by any sign bit value, anexponent field that is all zero bits, and a non-zero fraction fieldvalue. The numeric value encoded is (−1)̂^(S)2̂^(1-bias)(0.f).

Zero values are denoted by any sign bit value, and exponent field thatis all zero bits, and a fraction field that is all zero bits. Thenumeric value encoded is (−1)̂^(S)0. The distinction between +0 and −0 issignificant in some operations.

Half-Precision Floating-Point

Zeus half precision uses a format similar to standard 754'srequirements, reduced to a 16-bit overall format. The format containssufficient precision and exponent range to hold a 12-bit signed integer.

Single Precision Floating-Point

Zeus single precision satisfies standard 754's requirements for“single.”

Double-Precision Floating-Point

Zeus double precision satisfies standard 754's requirements for“double.”

Quad-Precision Floating-Point

Zeus quad precision satisfies standard 754's requirements for “doubleextended,” but has additional fraction precision to use 128 bits.

Zeus Processor

MicroUnity's Zeus processor provides the general-purpose, high-bandwidthcomputation capability of the Zeus system. Zeus includes high-bandwidthdata paths, register files, and a memory hierarchy. Zeus's memoryhierarchy includes on-chip instruction and data memories, instructionand data caches, a virtual memory facility, and interfaces to externaldevices. Zeus's interfaces in the initial implementation are solely the“Super Socket 7” bus, but other implementations may have different oradditional interfaces.

Architectural Framework

The Zeus architecture defines a compatible framework for a family ofimplementations with a range of capabilities. The followingimplementation-defined parameters are used in the rest of the documentin boldface. The value indicated is for MicroUnity's first Zeusimplementation.

Parameter Interpretation Value Range of legal values T number ofexecution threads 4 1 ≦ T ≦ 31 CE log₂ cache blocks in first-level 9 0 ≦CE ≦ 31 cache CS log₂ cache blocks in first-level 2 0 ≦ CS ≦ 4 cache setCT existence of dedicated tags in 1 0 ≦ CT ≦ 1 first-level cache LE log₂entries in local TB 0 0 ≦ LE ≦ 3 LB Local TB based on base 1 0 ≦ LB ≦ 1register GE log₂ entries in global TB 7 0 ≦ GE ≦ 15 GT log₂ threadswhich share a 1 0 ≦ GT ≦ 3 global TB

Interfaces and Block Diagram

The first implementation of Zeus uses “socket 7” protocols and pinouts.

Instruction Assembler Syntax

Instructions are specified to Zeus assemblers and other code tools(assemblers) in the syntax of an instruction mnemonic (operation code),then optionally white space (blanks or tabs) followed by a list ofoperands.

The instruction mnemonics listed in this specification are in upper case(capital) letters, assemblers accept either upper case or lower caseletters in the instruction mnemonics. In this specification, instructionmnemonics contain periods (“.”) to separate elements to make them easierto understand; assemblers ignore periods within instruction mnemonics.The instruction mnemonics are designed to be parsed uniquely without theseparating periods.

If the instruction produces a register result, this operand is listedfirst. Following this operand, if there are one or more source operands,is a separator which may be a comma (“,”), equal (“=”), or at-sign(“@”). The equal separates the result operand from the source operands,and may optionally be expressed as a comma in assembler code. Theat-sign indicates that the result operand is also a source operand, andmay optionally be expressed as a comma in assembler code. If theinstruction specification has an equal-sign, an at-sign in assemblercode indicates that the result operand should be repeated as the firstsource operand (for example, “A.ADD.I r4@5” is equivalent to “A.ADD.Ir4=r4,5”). Commas always separate the remaining source operands.

The result and source operands are case-sensitive; upper case and lowercase letters are distinct. Register operands are specified by the namesr0 (or r00) through r63 (a lower case “r” immediately followed by a oneor two digit number from 0 to 63), or by the special designations of“lp” for “r0,” “dp” for “r1,” “fp” for “r62,” and “sp” for “r63.”Integer-valued operands are specified by an optional sign (−) or (+)followed by a number, and assemblers generally accept a variety ofinteger-valued expressions.

Instruction Structure

A Zeus instruction is specifically defined as a four-byte structure withthe little-endian ordering shown below. It is different from the quadletdefined above because the placement of instructions into memory must beindependent of the byte ordering used for data structures. Instructionsmust be aligned on four-byte boundaries; in the diagram below, i must bea multiple of 4.

Gateway

A Zeus gateway is specifically defined as an 8-byte structure with thelittle-endian ordering shown below. A gateway contains a code addressused to securely invoke a system call or procedure at a higher privilegelevel. Gateways are marked by protection information specified in theTB. Gateways must be aligned on 8-byte boundaries; in the diagram below,i must be a multiple of 8.

The gateway contains two data items within its structure, a code addressand a new privilege level:

The virtual memory system can be used to designate a region of memory ascontaining gateways. Other data may be placed within the gateway region,provided that if an attempt is made to use the additional data as agateway, that security cannot be violated. For example, 64-bit data orstack pointers which are aligned to at least 4 bytes and are inlittle-endian byte order have pl=0, so that the privilege level cannotbe raised by attempting to use the additional data as a gateway.

User State

The user state consists of hardware data structures that are accessibleto all conventional compiled code. The Zeus user state is designed to beas regular as possible, and consists only of the general registers, theprogram counter, and virtual memory. There are no specialized registersfor condition codes, operating modes, rounding modes, integermultiply/divide, or floating-point values.

General Registers

Zeus user state includes 64 general registers. All are identical; thereis no dedicated zero-valued register, and there are no dedicatedfloating-point registers.

Some Zeus instructions have 64-bit register operands. These operands aresign-extended to 128 bits when written to the register file, and thelow-order 64 bits are chosen when read from the register file.

Definition

  def val ← RegRead(rn, size)  case size of   64:    val ←REG[rn]_(63..0)   128:    val ← REG[rn]  endcase enddef def RegWrite(rn,size, val)  case size of   64:    REG[rn] ← val₆₃ ⁶⁴ || val_(63..0)  128:    REG[rn] ← val_(127..0)  endcase enddef

Program Counter

The program counter contains the address of the currently executinginstruction. This register is implicitly manipulated by branchinstructions, and read by branch instructions that save a return addressin a general register.

Privilege Level

The privilege level register contains the privilege level of thecurrently executing instruction. This register is implicitly manipulatedby branch gateway and branch down instructions, and read by branchgateway instructions that save a return address in a general register.

Program Counter and Privilege Level

The program counter and privilege level may be packed into a singleoctlet. This combined data structure is saved by the Branch Gatewayinstruction and restored by the Branch Down instruction.

System State

The system state consists of the facilities not normally used byconventional compiled code. These facilities provide mechanisms toexecute such code in a fully virtual environment. All system state ismemory mapped, so that it can be manipulated by compiled code.

Fixed-Point

Zeus provides load and store instructions to move data between memoryand the registers, branch instructions to compare the contents ofregisters and to transfer control from one code address to another, andarithmetic operations to perform computation on the contents ofregisters, returning the result to registers.

Load and Store

The load and store instructions move data between memory and theregisters. When loading data from memory into a register, values arezero-extended or sign-extended to fill the register. When storing datafrom a register into memory, values are truncated on the left to fit thespecified memory region.

Load and store instructions that specify a memory region of more thanone byte may use either little-endian or big-endian byte ordering: thesize and ordering are explicitly specified in the instruction. Regionslarger than one byte may be either aligned to addresses that are an evenmultiple of the size of the region or of unspecified alignment:alignment checking is also explicitly specified in the instruction.

Load and store instructions specify memory addresses as the sum of abase general register and the product of the size of the memory regionand either an immediate value or another general register. Scalingmaximizes the memory space which can be reached by immediate offsetsfrom a single base general register, and assists in generating memoryaddresses within iterative loops. Alignment of the address can bereduced to checking the alignment of the first general register.

The load and store instructions are used for fixed-point data as well asfloating-point and digital signal processing data; Zeus has a singlebank of registers for all data types.

Swap instructions provide multithread and multiprocessorsynchronization, using indivisible operations: add-swap, compare-swap,multiplex-swap, and double-compare-swap. A store-multiplex operationprovides the ability to indivisibly write to a portion of an octlet.These instructions always operate on aligned octlet data, using eitherlittle-endian or big-endian byte ordering.

Branch

The fixed-point compare-and-branch instructions provide all arithmetictests for equality and inequality of signed and unsigned fixed-pointvalues. Tests are performed either between two operands contained ingeneral registers, or on the bitwise and of two operands. Depending onthe result of the compare, either a branch is taken, or not taken. Ataken branch causes an immediate transfer of the program counter to thetarget of the branch, specified by a 12-bit signed offset from thelocation of the branch instruction. A non-taken branch causes notransfer; execution continues with the following instruction.

Other branch instructions provide for unconditional transfer of controlto addresses too distant to be reached by a 12-bit offset, and totransfer to a target while placing the location following the branchinto a register. The branch through gateway instruction provides asecure means to access code at a higher privilege level, in a formsimilar to a normal procedure call.

Addressing Operations

A subset of general fixed-point arithmetic operations is available asaddressing operations. These include add, subtract, Boolean, and simpleshift operations. These addressing operations may be performed at apoint in the Zeus processor pipeline so that they may be completed priorto or in conjunction with the execution of load and store operations ina “superspring” pipeline in which other arithmetic operations aredeferred until the completion of load and store operations.

Execution Operations

Many of the operations used for Digital Signal Processing (DSP), whichare described in greater detail below, are also used for performingsimple scalar operations. These operations perform arithmetic operationson values of 8-, 16-,32-,64-, or 128-bit sizes, which are right-alignedin registers. These execution operations include the add, subtract,boolean and simple shift operations which are also available asaddressing operations, but further extend the available set to includethree-operand add/subtract, three-operand boolean, dynamic shifts, andbit-field operations.

Floating-Point

Zeus provides all the facilities mandated and recommended by ANSI/IEEEstandard 754-1985: Binary Floating-point Arithmetic, with the use ofsupporting software.

Branch Conditionally

The floating-point compare-and-branch instructions provide all thecomparison types required and suggested by the IEEE floating-pointstandard. These floating-point comparisons augment the usual types ofnumeric value comparisons with special handling for NaN (not-a-number)values. A NaN value compares as “unordered” with respect to any othervalue, even that of an identical NaN value.

Zeus floating-point compare-branch instructions do not generate anexception on comparisons involving quiet or signaling NaN values. Ifsuch exceptions are desired, they can be obtained by combining the useof a floating-point compare-set instruction, with either afloating-point compare-branch instruction on the floating-point operandsor a fixed-point compare-branch on the set result.

Because the less and greater relations are anti-commutative, one of eachrelation that differs from another only by the replacement of an L witha G in the code can be removed by reversing the order of the operandsand using the other code. Thus, an L relation can be used in place of aG relation by swapping the operands to the compare-branch or compare-setinstruction.

No instructions are provided that branch when the values are unordered.To accomplish such an operation, use the reverse condition to branchover an immediately following unconditional branch, or in the case of anif-then-else clause, reverse the clauses and use the reverse condition.

The E relation can be used to determine the unordered condition of asingle operand by comparing the operand with itself.

The following floating-point compare-branch relations are provided asinstructions:

compare-branch relations Mnemonic Branch taken if values compare as:Exception if C- Unord- unord- code like ered Greater Less Equal eredinvalid E == F F F T no no LG <> F T T F no no L < F F T F no no GE >= FT F T no no

Compare-Set

The compare-set floating-point instructions provide all the comparisontypes supported as branch instructions. Zeus compare-set floating-pointinstructions may optionally generate an exception on comparisonsinvolving quiet or signaling NaNs.

The following floating-point compare-set relations are provided asinstructions:

compare-set relations Mnemonic Result if values compare as Excepition ifC- Unord- unord- code like ered Greater Less Equal ered invalid E == F FF T no no LG <> F T T F no no L < F F T F no no GE >= F T F T no no E.X== F F F T no yes LG.X <> F T T F no yes L.X < F F T F yes yes GE.X <= FT F T yes yes

Arithmetic Operations

The basic operations supported in hardware are floating-point add,subtract, multiply, divide, square root and conversions amongfloating-point formats and between floating-point and binary integerformats.

Software libraries provide other operations required by the ANSI/IEEEfloating-point standard.

The operations explicitly specify the precision of the operation, andround the result (or check that the result is exact) to the specifiedprecision at the conclusion of each operation. Each of the basicoperations splits operand registers into symbols of the specifiedprecision and performs the same operation on corresponding symbols.

In addition to the basic operations, Zeus performs a variety ofoperations in which one or more products are summed to each other and/orto an additional operand. The instructions include a fused multiply-add(E.MUL.ADD.F), convolve (E.CON.F), matrix multiply (E.MUL.MAT.F), andscale-add (E.SCAL.ADD.F).

The results of these operations are computed as if the multiplies areperformed to infinite precision, added as if in infinite precision, thenrounded only once. Consequently, these operations perform theseoperations with no rounding of intermediate results that would havelimited the accuracy of the result.

NaN Handling

ANSI/IEEE standard 754-1985 specifies that operations involving asignaling NaN or invalid operation shall, if no trap occurs and if afloating-point result is to be delivered, deliver a quiet NaN as itsresult. However, it fails to specify what quiet NaN value to deliver.

Zeus operations that produce a floating-point result and do not trap oninvalid operations propagate signaling NaN values from operands toresults, changing the signaling NaN values to quiet NaN values bysetting the most significant fraction bit and leaving the remaining bitsunchanged. Other causes of invalid operations produce the default quietNaN value, where the sign bit is zero, the exponent field is all onebits, the most significant fraction bit is set and the remaing fractionbits are zero bits. For Zeus operations that produce multiple resultscatenated together, signaling NaN propagation or quiet NaN production ishandled separately and independently for each result symbol.

ANSI/IEEE standard 754-1985 specifies that quiet NaN values should bepropagated from operand to result by the basic operations. However, itfails to specify which of several quiet NaN values to propagate whenmore than one operand is a quiet NaN. In addition, the standard does notclearly specify how quiet NaN should be propagated for themultiple-operation instructions provided in Zeus. The standard does notspecify the quiet NaN produced as a result of an operand being asignaling NaN when invalid operation exceptions are handled by default.The standard leaves unspecified how quiet and signaling NaN values arepropagated though format conversions and the absolute-value, negate andcopy operations. This section specifies these aspects left unspecifiedby the standard.

First of all, for Zeus operations that produce multiple resultscatenated together, quiet and signaling NaN propagation is handledseparately and independently for each result symbol. A quiet orsignaling NaN value in a single symbol of an operand causes only thoseresult symbols that are dependent on that operand symbol's value to bepropagated as that quiet NaN. Multiple quiet or signaling NaN values insymbols of an operand which influence separate symbols of the result arepropagated independently of each other. Any signaling NaN that ispropagated has the high-order fraction bit set to convert it to a quietNaN.

For Zeus operations in which multiple symbols among operands upon whicha result symbol is dependent are quiet or signaling NaNs, a priorityRule will determine which NaN is propagated. Priority shall be given tothe operand that is specified by a register definition at alower-numbered (little-endian) bit position within the instruction (rbhas priority over rc, which has priority over rd). In the case ofoperands which are catenated from two registers, priority shall beassigned based on the register which has highest priority(lower-numbered bit position within the instruction). In the case of tie(as when the E.SCAL.ADD scaling operand has two corresponding NaNvalues, or when a E.MUL.CF operand has NaN values for both real andimaginary components of a value), the value which is located at alower-numbered (little-endian) bit position within the operand is toreceive priority. The identification of a NaN as quiet or signalingshall not confer any priority for selection—only the operand position,though a signaling NaN will cause an invalid operand exception.

The sign bit of NaN values propagated shall be complemented if theinstruction subtracts or negates the corresponding operand or (but notand) multiplies it by or divides it by or divides it into an operandwhich has the sign bit set, even if that operand is another NaN. If aNaN is both subtracted and multiplied by a negative value, the sign bitshall be propagated unchanged.

For Zeus operations that convert between two floating-point formats(INFLATE and DEFLATE), NaN values are propagated by preserving the signand the most-significant fraction bits, except that the most-significantbit of a signalling NaN is set and (for DEFLATE) the least-significantfraction bit preserved is combined, via a logical-or of all fractionbits not preserved. All additional fraction bits (for INFLATE) are setto zero.

For Zeus operations that convert from a floating-point format to afixed-point format (SINK), NaN values produce zero values(maximum-likelihood estimate). Infinity values produce the largestrepresentable positive or negative fixed-point value that fits in thedestination field. When exception traps are enabled, NaN or Infinityvalues producE a floating-point exception. Underflows do not occur inthe SINK operation, they produce −1, 0 or +1, depending on roundingcontrols.

For absolute-value, negate, or copy operations, NaN values arepropagated with the sign bit cleared, complemented, or copied,respectively. Signalling NaN values cause the Invalid operationexception, propagating a quieted NaN in corresponding symbol locations(default) or an exception, as specified by the instruction.

Floating-Point functions

Referring to FIG. 37, the following functions are defined for use withinthe detailed instruction definitions in the following section. In thesefunctions an internal format represents infinite-precisionfloating-point values as a four-element structure consisting of (1) s(sign bit): 0 for positive, 1 for negative, (2) t (type): NORM, ZERO,SNAN, QNAN, INFINITY, (3) e (exponent), and (4) f: (fraction). Themathematical interpretation of a normal value places the binary point atthe units of the fraction, adjusted by the exponent: (−1}̂^(S)*(2̂^(e))*f.The function F converts a packed IEEE floating-point value into internalformat. The function PackF converts an internal format back into IEEEfloating-point format, with rounding and exception control.

Digital Signal Processing

The Zeus processor provides a set of operations that maintain thefullest possible use of 128-bit data paths when operating onlower-precision fixed-point or floating-point vector values. Theseoperations are useful for several application areas, including digitalsignal processing, image processing and synthetic graphics. The basicgoal of these operations is to accelerate the performance of algorithmsthat exhibit the following characteristics:

Low-Precision Arithmetic

The operands and intermediate results are fixed-point values representedin no greater than 64 bit precision. For floating-point arithmetic,operands and intermediate results are of 16, 32, or 64 bit precision.

The fixed-point arithmetic operations include add, subtract, multiply,divide, shifts, and set on compare.

The use of fixed-point arithmetic permits various forms of operationreordering that are not permitted in floating-point arithmetic.Specifically, commutativity and associativity, and distributionidentities can be used to reorder operations. Compilers can evaluateoperations to determine what intermediate precision is required to getthe specified arithmetic result.

Zeus supports several levels of precision, as well as operations toconvert between these different levels. These precision levels arealways powers of two, and are explicitly specified in the operationcode.

When specified, add, subtract, and shift operations may cause afixed-point arithmetic exception to occur on resulting conditions suchas signed or unsigned overflow. The fixed-point arithmetic exception mayalso be invoked upon a signed or unsigned comparison.

Sequential Access to Data

The algorithms are or can be expressed as operations on sequentiallyordered items in memory. Scatter-gather memory access or sparse-matrixtechniques are not required.

Where an index variable is used with a “multiplier, such multipliersmust be powers of two. When the index is of the form: nx+k, the value ofn must be a power of two, and the values referenced should have kinclude the majority of values in the range 0..n−1. A negativemultiplier may also be used.

Vectorizable Operations

The operations performed on these sequentially ordered items areidentical and independent. Conditional operations are either rewrittento use Boolean variables or masking, or the compiler is permitted toconvert the code into such a form.

Data-Handling Operations

The characteristics of these algorithms include sequential access todata, which permit the use of the normal load and store operations toreference the data. Octlet and hexlet loads and stores reference severalsequential items of data, the number depending on the operand precision.

The discussion of these operations is independent of byte ordering,though the ordering of bit fields within octlets and hexlets must beconsistent with the ordering used for bytes. Specifically, if big-endianbyte ordering is used for the loads and stores, the figures below shouldassume that index values increase from left to right, and forlittle-endian byte ordering, the index values increase from right toleft. For this reason, the figures indicate different index values withdifferent shades, rather than numbering.

When an index of the nx+k form is used in array operands, where n is apower of 2, data memory sequentially loaded contains elements useful forseparate operands. The “shuffle” instruction divides a triclet of dataup into two hexlets, with alternate bit fields of the source tricletgrouped together into the two results. An immediate field, h, in theinstruction specifies which of the two regrouped hexlets to select forthe result. For example, two X.SHUFFLE.256 rd=rc,rb,32,128,h operationsrearrange the source triclet (c,b) into two hexlets as in FIG. 54A.

In the shuffle operation, two hexlet registers specify the sourcetriclet, and one of the two result hexlets are specified as hexletregister.

The example above directly applies to the case where n is 2. When n islarger, shuffle operations can be used to further subdivide thesequential stream. For example, when n is 4, we need to deal out 4 setsof doublet operands, as shown in FIG. 54B (An example of the use of afour-way deal is a digital signal processing application such asconversion of color to monochrome).

When an array result of computation is accessed with an index of theform nx+k, for n a power of 2, the reverse of the “deal” operation needsto be performed on vectors of results to interleave them for storage insequential order. The “shuffle” operation interleaves the bit fields oftwo octlets of results into a single hex let. For example a X.SHUFFLE.16operation combines two octlets of doublet fields into a hexlet as shownin FIG. 54C.

For larger values of n, a series of shuffle operations can be used tocombine additional sets of fields, similarly to the mechanism used forthe deal operations. For example, when n is 4, we need to shuffle up 4sets of doublet operands, as shown in FIG. 54D (An example of the use ofa four-way shuffle is a digital signal processing application such asconversion of monochrome to color).

When the index of a source array operand or a destination array resultis negated, or in other words, if of the form nx+k where n is negative,the elements of the array must be arranged in reverse order. The“swizzle” operation can reverse the order of the bit fields in a hexlet.For example, a X.SWIZZLE rd=rc, 127,112 operation reverses the doubletswithin a hexlet as shown in FIG. 47C.

In some cases, it is desirable to use a group instruction in which oneor more operands is a single value, not an array. The “swizzle”operation can also copy operands to multiple locations within a hexlet.For example, a X.SWIZZLE 15,0 operation copies the low-order 16 bits toeach double within a hexlet.

Variations of the deal and shuffle operations are also useful forconverting from one precision to another. This may be required if oneoperand is represented in a different precision than another operand orthe result, or if computation must be performed with intermediateprecision greater than that of the operands, such as when using aninteger multiply.

When converting from a higher precision to a lower precision,specifically when halving the precision of a hexlet of bit fields, halfof the data must be discarded, and the bit fields packed together. The“compress” operation is a variant of the “deal” operation, in which theoperand is a hex let, and the result is an octlet. An arbitraryhalf-sized sub-field of each bit field can be selected to appear in theresult. For example, a selection of bits 19..4 of each quadlet in ahexlet is performed by the X.COMPRESS rd=rc,16,4 operation as shown inFIG. 43D.

When converting from lower-precision to higher-precision, specificallywhen doubling the precision of an octlet of bit fields, one of severaltechniques can be used, either multiply, expand, or shuffle. Each hascertain useful properties. In the discussion below, m is the precisionof the source operand.

The multiply operation, described in detail below, automatically doublesthe precision of the result, so multiplication by a constant vector willsimultaneously double the precision of the operand and multiply by aconstant that can be represented in m bits.

An operand can be doubled in precision and shifted left with the“expand” operation, which is essentially the reverse of the “compress”operation. For example the X.EXPAND rd=rc, 16,4 expands from 16 bits to32, and shifts 4 bits left as shown in FIG. 54E.

The “shuffle” operation can double the precision of an operand andmultiply it by 1 (unsigned only), 2^(m) or 2^(m)+1, by specifying thesources of the shuffle operation to be a zeroed register and the sourceoperand, the source operand and zero, or both to be the source operand.When multiplying by 2 m, a constant can be freely added to the sourceoperand by specifying the constant as the right operand to the shuffle.

Arithmetic Operations

The characteristics of the algorithms that affect the arithmeticoperations most directly are low-precision arithmetic, and vectorizableoperations. The fixed-point arithmetic operations provided are most ofthe functions provided in the standard integer unit, except for those.That check conditions. These functions include add, subtract, bitwiseBoolean operations, shift, set on condition, and multiply, in forms thattake packed sets of bit fields of a specified size as operands. Thefloating-point arithmetic operations provided are as complete as thescalar floating-point arithmetic set. The result is generally a packedset of bit fields of the same size as the operands, except that thefixed-point multiply function intrinsically doubles the precision of thebit field.

Conditional operations are provided only in the sense that the set oncondition operations can be used to construct bit masks that can selectbetween alternate vector expressions, using the bitwise Booleanoperations. All instructions operate over the entire octlet or hex letoperands, and produce a hex let result. The sizes of the bit fieldssupported are always powers of two.

Galois Field Operations

Zeus provides a general software solution to the most common operationsrequired or Galois Field arithmetic. The instructions provided include apolynomial multiply, with the polynomial specified as one registeroperand. This instruction can be used to perform CRC generation andchecking, Reed-Solomon code generation and checking, and spread-spectrumencoding and decoding.

Software Conventions

The following section describes software conventions that are to beemployed at software module boundaries, in order to permit thecombination of separately compiled code and to provide standardinterfaces between application, library and system software. Registerusage and procedure call conventions may be modified, simplified oroptimized when a single compilation encloses procedures within acompilation unit so that the procedures have no external interfaces. Forexample, internal procedures may permit a greater number ofregister-passed parameters, or have registers allocated to avoid theneed to save registers at procedure boundaries, or may use a singlestack or data pointer allocation to suffice for more than one level ofprocedure call.

Register Usage

All Zeus registers are identical and general-purpose; there is nodedicated zero-valued register, and no dedicated floating-pointregisters. However, some procedure-call-oriented instructions implyusage of registers zero (0) and one (1) in a manner consistent with theconventions described below. By software convention, the non-specificgeneral registers are used in more specific ways.

register usage register assembler number names usage how saved  0 lp, r0link pointer caller  1 dp, r1 data pointer caller 2-9 r2-r9 parameterscaller 10-31 r10-r31 temporary caller 32-61 r32-r61 saved callee 62 fp,r62 frame pointer callee 63 sp, r63 stack pointer callee

At a procedure call boundary, registers are saved either by the calleror callee procedure, which provides a mechanism for leaf procedures toavoid needing to save registers. Compilers may choose to allocatevariables into caller or callee saved registers depending on how theirlifetimes overlap with procedure calls.

Procedure Calling Conventions

Procedure parameters are normally allocated in registers, starting fromregister 2 up to register 9. These registers hold up to 8 parameters,which may each be of any size from one byte to sixteen bytes (hexlet),including floating-point and small structure parameters. Additionalparameters are passed in memory, allocated on the stack. For Cprocedures which use varargs.h or stdarg.h and pass parameters tofurther procedures, the compilers must leave room in the stack memoryallocation to save registers 2 through 9 into memory contiguously withthe additional stack memory parameters, so that procedures such as_doprnt can refer to the parameters as an array.

Procedure return values are also allocated in registers, starting fromregister 2 up to register 9. Larger values are passed in memory,allocated on the stack.

There are several pointers maintained in registers for the procedurecalling conventions: lp, sp, dp, fp.

The lp register contains the address to which the callee should returnto at the conclusion of the procedure. If the procedure is also acaller, the lp register will need to be saved on the stack, once, beforeany procedure call, and restored, once, after all procedure calls. Theprocedure returns with a branch instruction, specifying the lp register.

The sp register is used to form addresses to save parameter and otherregisters, maintain local variables, i.e., data that is allocated as aLIFO stack. For procedures that require a stack, normally a singleallocation is performed, which allocates space for input parameters,local variables, saved registers, and output parameters all at once. Thesp register is always hexlet aligned.

The dp register is used to address pointers, literals and staticvariables for the procedure. The dp register points to a small(approximately 4096-entry) array of pointers, literals, andstatically-allocated variables, which is used locally to the procedure.The uses of the dp register are similar to the use of the gp register ona Mips R-series processor, except that each procedure may have adifferent value, which expands the space addressable by small offsetsfrom this pointer. This is an important distinction, as the offset fieldof Zeus load and store instructions are only 12 bits. The compiler mayuse additional registers and/or indirect pointers to address largerregions for a single procedure. The compiler may also share a single dpregister value between procedures which are compiled as a single unit(including procedures which are externally callable), eliminating theneed to save, modify and restore the dp register for calls betweenprocedures which share the same dp register value.

Load- and store-immediate-aligned instructions, specifying the dpregister as the base register, are generally used to obtain values fromthe dp region. These instructions shift the immediate value by thelogarithm of the size of the operand, so loads and stores of largeoperands may reach farther from the dp register than of small operands.Referring to FIG. 54F, the size of the addressable region is maximizedif the elements to be placed in the dp region are sorted according tosize, with the smallest elements placed closest to the dp base. Atpoints where the size changes, appropriate padding is added to keepelements aligned to memory boundaries matching the size of the elements.Using this technique, the maximum size of the dp region is always atleast 4096 items, and may be larger when the dp area is composed of amixture of data sizes.

The dp register mechanism also permits code to be shared, with eachstatic instance of the dp region assigned to a different address inmemory. In conjunction with position-independent or pc-relativebranches, this allows library code to be dynamically relocated andshared between processes.

To implement an inter-module (separately compiled) procedure call, thelp register is loaded with the entry point of the procedure, and the dpregister is loaded with the value of the dp register required for theprocedure. These two values are located adjacent to each other as a pairof octlet quantities in the dp region for the calling procedure. For astatically-linked inter-module procedure call, the linker fills in thevalues at link time. However, this mechanism also provides for dynamiclinking, by initially filling in the lp and dp fields in the datastructure to invoke the dynamic linker. The dynamic linker can use thecontents of the lp and/or dp registers to determine the identity of thecaller and callee, to find the location to fill in the pointers andresume execution. Specifically, the lp value is initially set to pointto an entry point in the dynamic linker, and the dp value is set topoint to itself: the location of the lp and dp values in the dp regionof the calling procedure. The identity of the procedure can bediscovered from a string following the dp pointer, or a separate table,indexed by the dp pointer.

The fp register is used to address the stack frame when the stack size,varies during execution of a procedure, such as when using the GNU Calloca function. When the stack size can be determined at compile time,the sp register is used to address the stack frame and the fp registermay be used for any other general purpose as a callee-saved register.caller:

Typical Static-Linked, Intra-Module Calling Sequence

 caller (non-leaf): caller: A.ADDI sp@-size // allocate caller stackframe S.I.64.A lp,sp,off // save original lp register ... (callee usingsame dp as caller) B.LINK.I callee ... ... (callee using same dp ascaller) B.LINK.I callee ... L.I.64.A lp=sp,off // restore original lpregister A.ADDI sp@size // deallocate caller stack frame B lp // return   callee (leaf): calLee: ... (code using dp) B lp // return

Procedures that are compiled together may share a common data region, inwhich case there is no need to save, load, and restore the dp region inthe callee, assuming that the callee does not modify the dp register.The pc-relative addressing of the B.LINK.I instruction permits the coderegion to be position-independent.

Minimum Static-Linked, Intra-Module Calling Sequence

 caller (non-leaf): caller: A.COPY r31=lp // save original lp register... (callee using same dp as caller) B.LINK.I callee ... ... (calleeusing same dp as caller) B.LINK.I callee ... B r31 // return    callee(leaf): callee: ... (code using dp, r31 unused) B lp // return

When all the callee procedures are intra-module, the stack frame mayalso be eliminated from the caller procedure by using “temporary” callersave registers not utilized by the callee leaf procedures. In additionto the lp value indicated above, this usage may include other values andvariables that live in the caller procedure across callee procedurecalls.

Typical Dynamic-Linked, Inter-Module Calling Sequence

 caller  (non-leaf): caller: A.ADDI sp@-size // allocate caller stackframe S.I.64.A lp,sp,off // save original lp register S.I.64.A dp,sp,off// save original dp register ... (code using dp) L.I.64.A lp=dp.off //load lp L.I.64.A dp=dp,off // load dp B.LINK lp=lp // invoke calleeprocedure L.I.64.A dp=sp,off // restore dp register from stack ... (codeusing dp) L.I.64.A lp=sp,off // restore original lp register A.ADDIsp=size // deallocate caller stack frame B lp // return callee (leaf):callee: ... (code using dp) B lp // return

The load instruction is required in the caller following the procedurecall to restore the dp register. A second load instruction also restoresthe lp register, which may be located at any point between the lastprocedure call and the branch instruction which returns from theprocedure.

System and Privileged Library Calls

It is an objective to make calls to system facilities and privilegedlibraries as similar as possible to normal procedure calls as describedabove. Rather than invoke system calls as an exception, which involvessignificant latency and complication, we prefer to use a modifiedprocedure call in which the process privilege level is quietly raised tothe required level. To provide this mechanism safely, interaction withthe virtual memory system is required.

Such a procedure must not be entered from anywhere other than itslegitimate entry point, to prohibit entering a procedure after the pointat which security checks are performed or with invalid registercontents, otherwise the access to a higher privilege level can lead to asecurity violation. In addition, the procedure generally must haveaccess to memory data, for which addresses must be produced by theprivileged code. To facilitate generating these addresses, thebranch-gateway instruction allows the privileged code procedure to relythe fact that a single register has been verified to contain a pointerto a valid memory region.

The branch-gateway instruction ensures both that the procedure isinvoked at a proper entry point, and that other registers such as thedata pointer and stack pointer can be properly set. To ensure this, thebranch-gateway instruction retrieves a “gateway” directly from theprotected virtual memory space. The gateway contains the virtual addressof the entry point of the procedure and the target privilege level. Agateway can only exist in regions of the virtual address spacedesignated to contain them, and can only be used to access privilegelevels at or below the privilege level at which the memory region can bewritten to ensure that a gateway cannot be forged.

The branch-gateway instruction ensures that register 1 (dp) contains avalid pointer to the gateway for this target code address by comparingthe contents of register 0 (lp) against the gateway retrieved frommemory and causing an exception trap if they do not match. By ensuringthat register 1 points to the gateway, auxiliary information, such asthe data pointer and stack pointer can be set by loading values locatedby the contents of register 1. For example, the eight bytes followingthe gateway may be used as a pointer to a data region for the procedure.

Referring to FIG. 54G, before executing the branch-gateway instruction,register 1 must be set to point at the gateway, and register 0 must beset to the address of the target code address plus the desired privilegelevel. A “L.I.64.L.A r0=r1,0” instruction is one way to set register 0,if register I has already been set, but any means of getting the correctvalue into register 0 is permissible.

Similarly, a return from a system or privileged routine involves areduction of privilege.

This need not be carefully controlled by architectural facilities, so aprocedure may freely branch to a less-privileged code address. Normally,such a procedure restores the stack frame, then uses the branch-downinstruction to return.

Typical Dynamic-Linked, Inter-Gateway Calling Sequence

caller: A.ADDI sp@-size // allocate caller stack frame S.I.64.Alp,sp,off S.I.64.A dp,sp,off ... L.I.64.A lp=dp.off // load lp L.I.64.Adp=dp,off // load dp B.GATE L.I.64.A dp,sp,off ... (code using dp)L.I.64.A lp=sp,off // restore original lp register A.ADDI sp=size //deallocate caller stack frame B lp // return   callee (non-leaf):callee: L.I.64.A dp=dp,off // load dp with data pointer S.I.64.Asp,dp,off L.I.64.A sp=dp,off // new stack pointer S.I.64.A lp,sp,offS.I.64.A dp,sp,off ... (using dp) L.I.64.A dp,sp,off ... (code using dp)L.I.64.A lp=sp,off // restore original lp register L.I.64.A sp=sp,off //restore original sp register B.DOWN lp   callee (leaf, no stack):callee: ... (using dp) B.DOWN lp

It can be observed that the calling sequence is identical to that of theinter-module calling sequence shown above, except for the use of theB.GATE instruction instead of a B.LINK instruction. Indeed, if a B.GATEinstruction is used when the privilege level in the lp register is nothigher than the current privilege level, the B.GATE instruction performsan identical function to a B.LINK.

The callee, if it uses a stack for local variable allocation, cannotnecessarily trust the value of the sp passed to it, as it can be forged.Similarly, any pointers which the callee provides should not be useddirectly unless it they are verified to point to regions which thecallee should be permitted to address. This can be avoided by definingapplication programming interfaces (APIs) in which all values are passedand returned in registers, or by using a trusted, intermediate privilegewrapper routine to pass and return parameters. The method describedbelow can also be used.

It can be useful to have highly privileged code call less-privilegedroutines. For example, a user may request that errors in a privilegedroutine be reported by invoking a user-supplied error-logging routine.To invoke the procedure, the privilege can be reduced via thebranch-down instruction. The return from the procedure actually requiresan increase in privilege, which must be carefully controlled. This isdealt with by placing the procedure call within a lower-privilegeprocedure wrapper, which uses the branch-gateway instruction to returnto the higher privilege region after the call through a secure re-entrypoint: Special care must be taken to ensure that the less-privilegedroutine is not permitted to gain unauthorized access by corruption ofthe stack or saved registers, such as by saving all registers andsetting up a new stack frame (or restoring the original lower-privilegestack) that may be manipulated by the less-privileged routine. Finally,such a technique is vulnerable to an unprivileged routine attempting touse the re-entry point directly, so it may be appropriate to keep aprivileged state variable which controls permission to enter at there-entry point.

Referring first to FIG. 1, a general purpose processor is illustratedtherein in block diagram form. In FIG. 1, four copies of an access unitare shown, each with an access instruction fetch queue A-Queue 101-104.Each access instruction fetch queue A-Queue 101104 is coupled to anaccess register file AR 105-108, which are each coupled to two accessfunctional units A 109-116. In a typical embodiment, each thread of theprocessor may have on the order of sixty-four general purpose registers(e.g., the AR's 105-108 and ER's 125-128). The access units functionindependently for four simultaneous threads of execution, and eachcompute program control flow by performing arithmetic and branchinstructions and access memory by performing load and storeinstructions. These access units also provide wide operand specifiersfor wide operand instructions. These eight access functional units A109-116 produce results for access register files AR 105-108 and memoryaddresses to a shared memory system 117-120.

In one embodiment, the memory hierarchy includes on-chip instruction anddata memories, instruction and data caches, a virtual memory facility,and interfaces to external devices. In FIG. 1, the memory system iscomprised of a combined cache and niche memory 117, an external businterface 118, and, externally to the device, a secondary cache 119 andmain memory system with 110 devices 120. The memory contents fetchedfrom memory system 117-120 are combined with execute instructions notperformed by the access unit, and entered into the four executeinstruction queues E-Queue 121-124. In accordance with one embodiment ofthe present invention, from the software perspective, the machine stateincludes a linear byte-addressed shared memory space. For wideinstructions, memory contents fetched from memory system 117-120 arealso provided to wide operand microcaches 132-136 by bus 137.Instructions and memory data from E-queue 121-124 are presented toexecution register files 125-128, which fetch execution register filesource operands. The instructions are coupled to the execution unitarbitration unit Arbitration 131, that selects which instructions fromthe four threads are to be routed to the available execution functionalunits E 141 and 149, X 142 and 148, G 143-144 and 146-147, and T 145.The execution functional units E 141 and 149, the execution functionalunits X 142 and 148, and the execution functional unit T 145 eachcontain a wide operand microcache 132-136, which are each coupled to thememory system 117 by bus 137.

The execution functional units G 143-144 and 146-147 are grouparithmetic and logical units that perform simple arithmetic and logicalinstructions, including group operations wherein the source and resultoperands represent a group of values of a specified symbol size, whichare partitioned and operated on separately, with results catenatedtogether. In a presently preferred embodiment the data path is 128 bitswide, although the present invention is not intended to be limited toany specific size of data path.

The execution functional units X 142 and 148 are crossbar switch unitsthat perform crossbar switch instructions. The crossbar switch units 142and 148 perform data handling operations on the data stream providedover the data path source operand buses 151-158, including deal,shuffles, shifts, expands, compresses, swizzles, permutes and reverses,plus the wide operations discussed hereinafter. In a key element of afirst aspect of the invention, at least one such operation will beexpanded to a width greater than the general register and data pathwidth. Examples of the data manipulation operations are described inanother section.

The execution functional units E 141 and 149 are ensemble units thatperform ensemble instructions using a large array multiplier, includinggroup or vector multiply and matrix multiply of operands partitionedfrom data path source operand buses 151-1 S8 and treated as integer,floating-point, polynomial or-Galois field values. According to thepresent embodiment of the invention, a general software solution isprovided to the most common operations required for Galois Fieldarithmetic. The instructions provided include a polynomial multiply,with the polynomial specified as one register operand. This instructioncan be used to perform CRC generation and checking, Reed-Solomon codegeneration and checking, and spread-spectrum encoding and decoding.Also, matrix multiply instructions and other operations described inanother section utilize a wide operand loaded into the wide operandmicrocache 132 and 136.

The execution functional unit T 145 is a translate unit that performstable-look-up operations on a group of operands partitioned from aregister operand, and catenates the result. The Wide Translateinstruction included in another section utilizes a wide operand loadedinto the wide operand microcache 134.

The execution functional units E 141, 149, execution functional unitsX-142, 148, and execution functional unit T each contain dedicatedstorage to permit storage of source operands including wide operands asdiscussed hereinafter. The dedicated storage 132-136, which may bethought of as a wide microcache, typically has a width which is amultiple of the width of the data path operands related to the data pathsource operand buses 151-158. Thus, if the width of the data path151-158 is 128 bits, the dedicated storage 132-136 may have a width of256, 512, 1024 or 2048 bits. Operands which utilize the full width ofthe dedicated storage are referred to herein as wide operands, althoughit is not necessary in all instances that a wide operand use theentirety of the width of the dedicated storage; it is sufficient thatthe wide operand use a portion greater than the width of the memory datapath of the output of the memory system 117-120 and the functional unitdata path of the input of the execution functional units 141-149, thoughnot necessarily greater than the width of the two combined. Because thewidth of the dedicated storage 132-136 is greater than the width of thememory operand bus 137, portions of wide operands are loadedsequentially into the dedicated storage 132-136. However, once loaded,the wide operands may then be used at substantially the same time. Itcan be seen that functional units 141-149 and associated executionregisters 125-128 form a data functional unit, the exact elements ofwhich may vary with implementation.

The execution register file ER 125-128 source operands are coupled tothe execution units 141-145 using source operand buses 151-154 and tothe execution units 145-149 using source operand buses 155-158. Thefunction unit result operands from execution units 141145 are coupled tothe execution register file ER 125-128 using result bus 161 and thefunction units result operands from execution units 145-149 are coupledto the execution register file using result bus 162.

The wide operands used in some embodiments of the present inventionprovide the ability to execute complex instructions such as the widemultiply matrix instruction shown in FIG. 2, which can be appreciated inan alternative form, as well, from FIG. 3. As can be appreciated fromFIGS. 2 and 3, a wide operand permits, for example, the matrixmultiplication of various sizes and shapes which exceed the data pathwidth. The example of FIG. 2 involves a matrix specified by register rchaving a 128*64/size multiplied by a vector contained in register rbhaving a 128 size, to yield a result, placed in register rd, of 128bits.

The operands that are substantially larger than the data path width ofthe processor are provided by using a general-purpose register tospecify a memory specifier from which more—than one but in someembodiments several data path widths of data can be read into thededicated storage. The memory specifier typically includes the memoryaddress together with the size and shape of the matrix of data beingoperated on. The memory specifier or wide operand specifier can bebetter appreciated from FIG. 5, in which a specifier 500 is seen to bean address, plus a field representative of the size/2 and a furtherfield representative of width 12, where size is the product of the depthand width of the data. The address is aligned to a specified size, forexample sixty-four bytes, so that a plurality of low order bits (forexample, six bits) are zero. The specifier 500 can thus be seen tocomprise a first field 505 for the address, plus two field indicia 510within the low order six bits to indicate size and width.

The decoding of the specifier 500 may be further appreciated from FIG. 6where, for a given specifier 600 made up of an address field 605together with a field 610 comprising plurality of low order bits. By aseries of arithmetic operations shown at steps 615 and 620, the portionof the field 610 representative of width 12 is developed. In a similarseries of steps shown at 625 and 630, the value oft is decoded, whichcan then be used to decode both size and address. The portion of thefield 610 representative of size/2 is decoded as shown at steps 635 and640, while the address is decoded in a similar way at steps 645 and 650.

The wide function unit may be better appreciated from FIG. 7, in which aregister number 700 is provided to an operand checker 705. Wide operand,specifier 710 communicates with the operand checker 705 and alsoaddresses memory 715 having a defined memory width. The memory addressincludes a plurality of register operands 720A-n, which are accumulatedin a dedicated storage portion 714 of a data functional unit 725. In theexemplary embodiment shown in FIG. 7, the dedicated storage 714 can beseen to have a width equal to eight data path widths” such that eightwide operand portions 730A-H are sequentially loaded into the dedicatedstorage to form the wide operand. Although eight portions are shown inFIG. 7, the present invention is not limited to eight or any otherspecific multiple of data path widths. Once the wide operand portions730A-H are sequentially loaded, they may be used as a single wideoperand 735 by the functional element 740, which may be any element(s)from FIG. 1 connected thereto. The result of the wide operand is thenprovided. to a result register 745, which in a presently preferredembodiment is of the same width as the memory width.

Once the wide operand is successfully loaded into the dedicated storage714, a second aspect of the present invention may be appreciated.Further execution of this instruction or other similar instructions thatspecify the same memory address can read the dedicated storage to obtainthe operand value under specific conditions that determine whether thememory operand has been altered by intervening instructions. Assumingthat these conditions are met, the memory operand fetch from thededicated storage is combined with one or more register operands in thefunctional unit, producing a result. In some embodiments, the size ofthe result is limited to that of a general register, so that no similardedicated storage is required for the result. However, in some differentembodiments, the result may be a wide operand, to further enhanceperformance.

To permit the wide operand value to be addressed by subsequentinstructions specifying the same memory address, various conditions mustbe checked and confirmed:

Those conditions include:

1. Each memory store instruction checks the memory address against thememory addresses recorded for the dedicated storage. Any match causesthe storage to be marked invalid, since a memory store instructiondirected to any of the memory addresses stored in dedicated storage 714means that data has been overwritten.

2. The register number used to address the storage is recorded. If nointervening instructions have written to the register, and the sameregister is used on the subsequent instruction, the storage is valid(unless marked invalid by rule #1).

3. If the register has been modified or a different register number isused, the value of the register is read and compared against the addressrecorded for the dedicated storage. This uses more resources than #1because of the need to fetch the register contents and because the widthof the register is greater than that of the register number itself. Ifthe address matches, the storage is valid. The new register number isrecorded for the dedicated storage.

If conditions #2 or #3 are not met, the register contents are used toaddress the general-purpose processor's memory and load the dedicatedstorage. If dedicated storage is already fully loaded, a portion of thededicated storage must be discarded (victimized) to make room for thenew value. The instruction is then performed using the newly updateddedicated storage. The address and register number is recorded’ for thededicated storage.

By checking the above conditions, the need for saving and restoring thededicated storage is eliminated. In addition, if the context of theprocessor is changed and the new context does not employ Wideinstructions that reference the same dedicated storage, when theoriginal context is restored, the contents of the dedicated storage areallowed to be used without refreshing the value from memory, usingchecking rule #3. Because the values in the dedicated storage are readfrom memory and not modified directly by performing wide operations, thevalues can be discarded at any time without saving the results intogeneral memory. This property simplifies the implementation of rule #4above.

An alternate embodiment of the present invention can replace rule #1above with the following rule:

1a. Each memory store ‘instruction checks the memory address against thememory addresses recorded for the dedicated storage. Any match causesthe dedicated storage to be updated, as well as the general memory.

By use of the above rule 1.a, memory store instructions can modify thededicated storage, updating just the piece of the dedicated storage thathas been changed, leaving the remainder intact. By continuing to updatethe general memory, it is still true that the contents of the dedicatedmemory can be discarded at any time without saving the results intogeneral memory. Thus rule #4 is not made more complicated by thischoice. The advantage of this alternate embodiment is that the dedicatedstorage need not be discarded (invalidated) by memory store operations.

Referring next to FIG. 9, an exemplary arrangement of the datastructures of the wide microcache or dedicated storage 114 may be betterappreciated. The wide microcache contents, wmc.c, can be seen to form aplurality of data path widths 900A-n, although in the example shown thenumber is eight. The physical address, wmc.pa, is shown as 64 bits inthe example shown, although the invention is not limited to a specificwidth. The size of the contents, wmc.size, is also provided in a fieldwhich is shown as 10 bits in an exemplary embodiment. A “contents valid”flag, wmc.ev, of one bit is also included in the data structure,together with a two bit field for thread last used, or wmc.th. Inaddition, a six bit field for register last used, wmc.reg, is providedin an exemplary embodiment. Further, a one bit flag for register andthread valid, or wmc.rtv, may be provided.

The process by which the microcache is initially written with a wideoperand, and thereafter verified as valid for fast subsequentoperations, may be better appreciated from FIG. 8. The process begins at800, and progresses to step 805 where a check of the register contentsis made against the stored value wme.rc. If true, a check is made atstep 810 to verify the thread. If true, the process then advances tostep 815 to verify whether the register and thread are valid. If step815 reports as true, a check is made at step 820 to verify whether thecontents are valid. If all of steps 805 through 820 return as true, thesubsequent instruction is able to utilize the existing wide operand asshown at step 825, after which the process ends. However, if any ofsteps 805 ‘through 820 return as false, the process branches to step830, where content, physical address and size are set. Because steps 805through 820 all lead to either step 825 or 830, steps 805 through 820may be performed in any order or simultaneously without altering theprocess. The process then advances to step 835 where size is checked.This check basically ensures that the size of the translation unit isgreater than or equal to the size of the wide operand, so that aphysical address can directly replace the use of a virtual address. Theconcern is that, in some embodiments, the wide operands may be largerthan the minimum region that the virtual memory system is capable ofmapping. As a result, it would be possible for a single contiguousvirtual address range to be mapped into multiple, disjoint physicaladdress ranges, complicating the task of comparing physical addresses.By determining the size of the wide operand and comparing that sizeagainst the size of the virtual address mapping region which isreferenced, the instruction is aborted with an exception trap if thewide operand is larger than the mapping region. This ensures secureoperation of the processor. Software can then re-map the region using alarger size map to continue execution if desired. Thus, if size isreported as unacceptable at step 835, an exception is generated at step840. If size is acceptable, the process advances to step 845 wherephysical address is checked. If the check reports as met, the processadvances to step 850, where a check of the contents valid flag is made.If either check at step 845 or 850 reports as false, the processbranches and new content is written into the dedicated storage 114, withthe fields thereof being set accordingly. Whether the check at step 850reported true, or whether new content was written at step 855, theprocess advances to step 860 where appropriate fields are set toindicate the validity of the data, after which the requested functioncan be performed at step 825. The process then ends.

Referring next to FIGS. 10 and 11, which together show the operation ofthe microcache controller from a hardware standpoint, the operation ofthe microcache controller may be better understood. In the hardwareimplementation, it is clear that conditions which are indicated assequential steps in FIGS. 8 and 9 above can be performed in parallel,reducing the delay for such wide operand checking. Further, a copy ofthe indicated hardware may be included for each wide microcache, andthereby all such microcaches as may be alternatively referenced by aninstruction can be tested in parallel. It is believed that no furtherdiscussion of FIGS. 10 and 11 is required in view of the extensivediscussion of FIGS. 8 and 9, above.

Various alternatives to the foregoing approach do exist for the use ofwide operands, including an implementation in which a single instructioncan accept two wide operands, partition the operands into symbols,multiply corresponding symbols together, and add the products to producea single scalar value or a vector of partitioned values of width of theregister file, possibly after extraction of a portion of the sums. Suchan instruction can be valuable for detection of motion or estimation ofmotion in video compression. A further enhancement of such aninstruction can incrementally update the dedicated storage if theaddress of one wide operand is within the range of previously specifiedwide operands in the dedicated storage, by loading only the portion notalready within the range and shifting the in-range portion as required.Such an enhancement allows the operation to be performed over a “slidingwindow” of possible values. In such an instruction, one wide operand isaligned and supplies the size and shape information, while the secondwide operand, updated incrementally, is not aligned.

Another alternative embodiment of the present invention can defineadditional instructions where the result operand is a wide operand. Suchan enhancement removes the limit that a result can be no larger than thesize of a general register, further enhancing performance. These wideresults can be cached locally to the functional unit that created them,but must be copied to the general memory system before the storage canbe reused and before the virtual memory system alters the mapping of theaddress of the wide result. Data paths must be added so that loadoperations and other wide operations can read these wideresults—forwarding of a wide result from the output of a functional unitback to its input is relatively easy, but additional data paths may haveto be introduced if it is desired to forward wide results back to otherfunctional units as wide operands.

As previously discussed, a specification of the size and shape of thememory operand is included in the low-order bits of the address. In apresently preferred implementation, such memory operands are typically apower of two in size and aligned to that size. Generally, one-half thetotal size is added (or inclusively or'ed, or exclusively or'ed) to thememory address, and one half of the data width is added (or inclusivelyor'ed, or exclusively or'ed) to the memory address. These bits can bedecoded and stripped from the memory address, so that the controller ismade to step through all the required addresses. This decreases thenumber of distinct operands required for these instructions, as thesize, shape and address of the memory operand are combined into a singleregister operand value.

Particular examples of wide operations which are defined by the presentinvention include the Wide Switch instruction that performs bit-levelswitching; the Wide Translate instruction which performs byte (orlarger) table-lookup; Wide Multiply Matrix, Wide Multiply Matrix Extractand Wide Multiply Matrix Extract Immediate (discussed below), WideMultiply Matrix Floating-point, and Wide Multiply Matrix Galois (alsodiscussed below). While the discussion below focuses on particular sizesfor the exemplary instructions, it will be appreciated that theinvention is not limited to a particular width.

The Wide Switch instruction rearranges the contents of up to tworegisters (256 bits) at the bit level, producing a full-width (128 bits)register result. To control the rearrangement, a wide operand specifiedby a single register, consisting of eight bits per bit position is used.For each result bit position, eight wide operand bits for each bitposition select which of the 256 possible source register bits to placein the result. When a wide operand size smaller than 128 bytes, the highorder bits of the memory operand are replaced with values correspondingto the result bit position, so that the memory operand specifies a bitselection within symbols of the operand size, performing the sameoperation on each symbol.

The Wide Translate instructions use a wide operand to specify a table ofdepth up to 256 entries and width of up to 128 bits. The contents of aregister is partitioned into operands of one, two, four, or eight bytes,and the partitions are used to select values from the table in parallel.The depth and width of the table can be selected by specifying the sizeand shape of the wide operand as described above.

The Wide Multiply Matrix instructions use a wide operand to specify amatrix of values of width up to 64 bits (one half of register file anddata path width) and depth of up to 128 bits/symbol size. The contentsof a general register (128 bits) is used as a source operand,partitioned into a vector of symbols, and multiplied with the matrix,producing a vector of width up to 128 bits of symbols of twice the sizeof the source operand symbols. The width and depth of the matrix can beselected by specifying the size and shape of the wide operand asdescribed above. Controls within the instruction allow specification ofsigned, mixed-signed, unsigned, complex, or polynomial operands.

The Wide Multiply Matrix Extract instructions use a wide operand tospecify a matrix of value of width up to 128 bits (full width ofregister file and data path) and depth of up to 128 bits/symbol size.The contents of a general register (128 bits) is used as a sourceoperand, partitioned into a vector of symbols, and multiplied with thematrix, producing a vector of width up to 256 bits of symbols of twicethe size of the source operand symbols plus additional bits to representthe sums of products without overflow. The results are then extracted ina manner described below (Enhanced Multiply Bandwidth by ResultExtraction), as controlled by the contents of a general registerspecified by the instruction. The general register also specifies theformat of the operands: signed, mixed-signed, unsigned, and complex aswell as the size of the operands, byte (8 bit), doublet (16 bit),quadlet (32 bit), or hexlet (64 bit).

The Wide Multiply Matrix Extract Immediate instructions perform the samefunction as above, except that the extraction, operand format and sizeis controlled by fields in the instruction. This form encodes commonforms of the above instruction without the need to initialize a registerwith the required control information. Controls within the instructionallow specification of signed, mixed-signed, unsigned, and complexoperands.

The Wide Multiply Matrix Floating-point instructions perform a matrixmultiply in the same form as above, except that the multiplies andadditions are performed in floating-point arithmetic. Sizes of half(16-bit), single (32-bit), double (64-bit), and complex sizes of half,single and double can be specified within the instruction.

Wide Multiply Matrix Galois instructions perform a matrix multiply inthe same form as above, except that the multiples and additions areperformed in Galois field arithmetic. A size of 8 bits can be specifiedwithin the instruction. The contents of a general register specify thepolynomial with which to perform the Galois field remainder operation.The nature of the matrix multiplication is novel and described in detailbelow.

In another aspect of the invention, memory operands of eitherlittle-endian or big-endian conventional byte ordering are facilitated.Consequently, all Wide operand instructions are specified in two forms,one for little-endian byte ordering and one for big-endian byteordering, as specified by a portion of the instruction. The byte orderspecifies to the memory system the order in which to deliver the byteswithin units of the data path width (128 bits), as well as the order toplace multiple memory words (128 bits) within a larger Wide operand.Each of these instructions is described in greater detail.

Some embodiments of the present invention address extraction of a highorder portion of a multiplier product or sum of products, as a way ofefficiently utilizing a large multiplier array. Parent U.S. Pat. No.5,742,840 and U.S. Pat. No. 5,953,241 describe a system and method forenhancing the utilization of a multiplier array by adding specificclasses of instructions to a general-purpose processor. This addressesthe problem of making the most use of a large multiplier array that isfully used for high-precision arithmetic—for example a 64.times.64 bitmultiplier is fully used by a 64-bit by 64-bit multiply, but only onequarter used for a 32-bit by 32-bit multiply) for (relative to themultiplier data width and registers) low-precision arithmeticoperations. In particular, operations that perform a great manylow-precision multiplies which are combined (added) together in variousways are specified. One of the overriding considerations in selectingthe set of operations is a limitation on the size of the result operand.In an exemplary embodiment, for example, this size might be limited toon the order of 128 bits, or a single register, although no specificsize limitation need exist.

The size of a multiply result, a product, is generally the sum of thesizes of the operands, multiplicands and multiplier. Consequently,multiply instructions specify operations in which the size of the resultis twice the size of identically-sized input operands. For our prior artdesign, for example, a multiply instruction accepted two 64-bit registersources and produces a single 128-bit register-pair result, using anentire 64.times.64 multiplier array for 64-bit symbols, or half themultiplier array for pairs of 32-bit symbols, or one-quarter themultiplier array for quads of 16-bit symbols. For all of these cases,note that two register sources of 64 bits are combined, yielding a128-bit result.

In several of the operations, including complex multiplies, convolve,and matrix multiplication, low-precision multiplier products are addedtogether. The additions further increase the required precision. The sumof two products requires one additional bit of precision; adding fourproducts requires two, adding eight products requires three, addingsixteen products requires four. In some prior designs, some of thisprecision is lost, requiring scaling of the multiplier operands to avoidoverflow, further reducing accuracy of the result.

The use of register pairs creates an undesirable complexity, in thatboth the register pair and individual register values must be bypassedto subsequent instructions. As a result, with prior art techniques onlyhalf of the source operand 128-bit register values could be employedtoward producing a single-register 128-bit result.

In some embodiments of the present invention, a high-order portion ofthe multiplier product or sum of products is extracted, adjusted by adynamic shift amount from a general register or an adjustment specifiedas part of the instruction, and, rounded by a control value from aregister or instruction portion as round-to-nearest/even, toward zero,floor, or ceiling. Overflows are handled by limiting the result to thelargest and smallest values that can be accurately represented in theoutput result.

In the present invention, when the extract is controlled by a register,the size of the result can be specified, allowing rounding and limitingto a smaller number of bits than can fit in the result. This permits theresult to be scaled to be used in subsequent operations without concernof overflow or rounding, enhancing performance.

Also in the present invention, when the extract is controlled by aregister, a single register value defines the size of the operands, theshift amount and size of the result, and the rounding control. Byplacing all this control information in a single register, the size ofthe instruction is reduced over the number of bits that such ainstruction would otherwise require, improving performance and enhancingflexibility of the processor.

The particular instructions included in this aspect of the presentinvention are Ensemble Convolve Extract, Ensemble Multiply Extract,Ensemble Multiply Add Extract and Ensemble Scale Add Extract, each ofwhich is more thoroughly treated in another section.

An aspect of the present invention defines the Ensemble Scale AddExtract instruction, that combines the extract control information in aregister along with two values that are used as scalar multipliers tothe contents of two vector multiplicands. This combination reduces thenumber of registers that would otherwise be required, or the number ofbits that the instruction would otherwise require, improvingperformance.

Several of these instructions (Ensemble Convolve Extract, EnsembleMultiply Add Extract) are typically available only in forms where theextract is specified as part of the instruction. An alternativeembodiment can incorporate forms of the operations in which the size ofthe operand, the shift amount and the rounding can be controlled by thecontents of a general register (as they are in the Ensemble MultiplyExtract instruction). The definition of this kind of instruction forEnsemble Convolve Extract, and Ensemble Multiply Add Extract wouldrequire four source registers, which increases complexity by requiringadditional general-register read ports.

Another alternative embodiment can reduce the number of register readports required for implementation of instructions in which the size,shift and rounding of operands is controlled by a register. The value ofthe extract control register can be fetched using an additional cycle onan initial execution and retained within or near the functional unit forsubsequent executions, thus reducing the amount of hardware, requiredfor implementation with a small additional performance penalty. Thevalue retained would be marked invalid, causing are-fetch of the extractcontrol register, by instructions that modify the register, oralternatively, the retained value can be updated by such an operation.Are-fetch of the extract control register would also be required if adifferent register number were specified on a subsequent execution. Itshould be clear that the properties of the above two alternativeembodiments can be combined.

Another embodiment of the invention includes Galois field arithmetic,where multiplies are performed by an initial binary polynomialmultiplication (unsigned binary multiplication with carries suppressed),followed by a polynomial modulo/remainder operation (unsigned binarydivision with carries suppressed). The remainder operation is relativelyexpensive in area and delay. In Galois field arithmetic, additions areperformed by binary addition with carries suppressed, or equivalently, abitwise exclusive-or operation. In this aspect of the present invention,a matrix multiplication is performed using Galois field arithmetic,where the multiplies and additions are Galois field multiples andadditions.

Using prior art methods, a 16 byte vector multiplied by a 16.times.16byte matrix can be performed as 256 8-bit Galois field multiplies and16*15=240 8-bit Galois field additions. Included in the 256 Galois fieldmultiplies are 256 polynomial multiplies and 256 polynomial remainderoperations. But by use of the present invention, the total computationcan be reduced significantly by performing 256 polynomial multiplies,240 16-bit polynomial additions, and 16 polynomial remainder operations.Note that the cost of the polynomial additions has been doubled, asthese are now 16-bit operations, but the cost of the polynomialremainder functions has been reduced by a factor of 16. Overall, this isa favorable tradeoff, as the cost of addition is much lower than thecost of remainder.

In a still further aspect of the present invention, a technique isprovided for incorporating floating point information into processorinstructions. In U.S. Pat. No. 5,812,439, a system and method aredescribed for incorporating control of rounding and exceptions forfloating-point instructions into the instruction itself. The presentinvention extends this invention to include separate instructions inwhich rounding is specified, but default handling of exceptions is alsospecified, for a particular class of floating-point instructions.Specifically, the SINK instruction (which converts floating-point valuesto integral values) is available with control in the instruction thatinclude all previously specified combinations (default-near rounding anddefault exceptions, Z—round-toward-zero and trap on exceptions, N—roundto nearest and trap on exceptions, F—floor rounding (toward minusinfinity) and trap on exceptions, C—ceiling rounding (toward plusinfinity) and trap on exceptions, and X—trap on inexact and otherexceptions), as well as three new combinations (Z.D—round toward zeroand default exception handling, F.D—floor rounding and default exceptionhandling, and C.D—ceiling rounding and default exception handling). (Theother combinations: N.D is equivalent to the default, and X.D—trap oninexact but default handling for other exceptions is possible but notparticularly valuable).

Instruction Scheduling

The next section describes detailed pipeline organization for Zeus,which has a significant influence on instruction scheduling. Here wewill elaborate some general rules for effective scheduling by acompiler. Specific information on numbers of functional units,functional unit parallelism and latency is quiteimplementation-dependent, values indicated here are valid for Zeus'sfirst implementation.

Separate Addressing from Execution

Zeus has separate function units to perform addressing operations (A, L,S, B instructions) from execution operations (G, X, E, W instructions).When possible, Zeus will execute all the addressing operations of aninstruction stream, deferring execution of the execution operationsuntil dependent load instructions are completed. Thus, the latency ofthe memory system is hidden, so long as addressing operations themselvesdo not need to wait for memory.

Software Pipeline

Instructions should generally be scheduled so that previous operationscan be completed at the time of issue. When this is not possible, theprocessor inserts sufficient empty cycles to perform the instructionsprecisely—explicit no-operation instructions are not required.

Multiple Issue

Zeus can issue up to two addressing operations and up to two executionoperations per cycle per thread. Considering functional unitparallelism, described below, as many of four instruction issues percycle are possible per thread.

Functional Unit Parallelism

Zeus has separate function units for several classes of executionoperations. An A unit performs scalar add, subtract, boolean, andshift-add operations for addressing and branch calculations. Theremaining functional units are execution resources, which performoperations subsequent to memory loads and which operate on values in aparallel, partitioned form. A G unit performs add, subtract, boolean,and shift-add operations. An X unit performs general shift operations.An E unit performs multiply and floating-point operations. A T unitperforms table-look-up operations.

Each instruction uses one or more of these units, according to the tablebelow.

Instruction A G X E T A. x B x L x S x G x X x E x x W.TRANSLATE x xW.MULMAT x x x W.SWITCH x x

Latency

The latency of each functional unit depends on what operation isperformed in the unit, and where the result is used. The aggressivenature of the pipeline makes it difficult to characterize the latency ofeach operation with a single number. Because the addressing unit isdecoupled from the execution unit, the latency of load operations isgenerally hidden, unless the result of a load instruction must bereturned to the addressing unit. Store instructions must be able tocompute the address to which the data is to be stored in the addressingunit, but the data will not be irrevocably stored until the data isavailable and it is valid to retire the store instruction. However,under certain conditions, data may be forwarded from a store instructionto subsequent load instructions, once the data is available.

The latency of each of these units, for the initial Zeus implementationis indicated below:

Unit instruction Latency rules A. A 1 cycle L Address operands must beready to issue, 4 cycles to A unit, 0 to G, X, E, T units S Addressoperands must be ready to issue, Store occurs when data is ready andinstruction may be retired. B Conditional branch operands may beprovided from the A unit (54-bit values), or the G unit (128-bitvalues). 4 cycles for mispredicted branch W Address operand must beready to issue, G G 1 cycle X X, W.SWITCH 1 cycle for data operands, 2cycles for shift amount or control operand E E, W.MULMAT 4 cycles TW.TRANSLATE 1 cycles

Pipelining and Multithreading

As shown in FIG. 4, some embodiments of the present invention employboth decoupled access from execution pipelines and simultaneousmultithreading in a unique way. Simultaneous Multithreaded pipelineshave been employed in prior art to enhance the utilization of data pathunits by allowing instructions to be issued from one of severalexecution threads to each functional unit (e.g., Susan Eggers,University of Wash, papers on Simultaneous Multithreading).

Decoupled access from execution pipelines have been employed in priorart to enhance the utilization of execution data path units by bufferingresults from an access unit, which computes addresses to a memory unitthat in turn fetches the requested items from memory, and thenpresenting them to an execution unit (e.g., James E. Smith, paper onDecoupled Access from Execution).

Compared to conventional pipelines, Eggers prior art used an additionalpipeline cycle before instructions could be issued to functional units,the additional cycle needed to determine which threads should bepermitted to issue instructions. Consequently, relative to conventionalpipelines, the prior art design had additional delay, includingdependent branch delay.

The embodiment shown in FIG. 4 contains individual access data pathunits, with associated register files, for each execution thread. Theseaccess units produce addresses, which are aggregated together to acommon memory unit, which fetches all the addresses and places thememory contents in one or more buffers. Instructions for executionunits, which are shared to varying degrees among the threads are alsobuffered for later execution. The execution units then performoperations from all active threads using functional data path units thatare shared.

For instructions performed by the execution units, the extra cyclerequired for prior art simultaneous multithreading designs is overlappedwith the memory data access time from prior art decoupled access fromexecution cycles, so that no additional delay is incurred by theexecution functional units for scheduling resources. For instructionsperformed by the access units, by employing individual access units foreach thread the additional cycle for scheduling shared resources is alsoeliminated.

This is a favorable tradeoff because, while threads do not share theaccess functional units, these units are relatively small compared tothe execution functional units, which are shared by threads.

FIG. 12 is a timing diagram of a decoupled pipeline structure inaccordance with one embodiment of the present invention. As illustratedin FIG. 12, the time permitted by a pipeline to service load operationsmay be flexibly extended. Here, various types of instructions areabbreviated as A, L, B, E, and S, representing a register-to-registeraddress calculation, a memory load, a branch, a register-to-registerdata calculation, and a memory store, respectively. According to thepresent embodiment, the front of the pipeline, in which A, Land B typeinstructions are handled, is decoupled from the back of the pipeline, inwhich E, and S type instructions are handled. This decoupling occurs atthe point at which the data cache and its backing memory is referenced;similarly, a FIFO that is filled by the instruction fetch unit decouplesinstruction cache references from the front of the pipeline shown above.The depth of the FIFO structures is implementation-dependent, i.e. notfixed by the architecture. FIG. 13 further illustrates this pipelineorganization. Accordingly, the latency of load instructions can behidden, as execute instructions are deferred until the results of theload are available. Nevertheless, the execution unit still processesinstructions in normal order, and provides precise exceptions. Moredetails relating to this pipeline structure is explained in the“Superspring Pipeline” section.

A difficulty in particular pipeline structures is that dependentoperations must be separated by the latency of the pipeline, and forhighly pipe lined machines, the latency of simple operations can bequite significant. According to one embodiment of the present invention,very highly pipelined implementations are provided by alternatingexecution of two or more independent threads. In an embodiment, a threadis the state required to maintain an independent execution; thearchitectural state required is that of the register file contents,program counter, privilege level, local TB, and when required, exceptionstatus. In an embodiment, ensuring that only one thread may handle anexception at one time may minimize the latter state, exception status.In order to ensure that all threads make reasonable forward progress,several of the machine resources must be scheduled fairly.

An example of a resource that is critical that it be fairly shared isthe data memory/cache subsystem. In one embodiment, the processor may beable to perform a load operation only on every second cycle, and a storeoperation only on every fourth cycle. The processor schedules thesefixed timing resources fairly by using a round-robin schedule for anumber of threads that is relatively prime to the resource reuse rates.In one embodiment, five simultaneous threads of execution ensure thatresources which may be used every two or four cycles are fairly sharedby allowing the instructions which use those resources to be issued onlyon every second or fourth issue slot for that thread. More detailsrelating to this pipeline structure are explained in the “SuperthreadPipeline” section.

Referring back to FIG. 4, with regard to the sharing of execution units,one embodiment of the present invention employs several differentclassics of functional units for the execution unit, with varying cost,utilization, and performance. In particular, the G units, which performsimple addition and bitwise operations is relatively inexpensive (inarea and power) compared to the other units, and its utilization isrelatively high. Consequently, the design employs four such units, whereeach unit can be shared between two threads. The X unit, which performsa broad class of data switching functions is more expensive and lessused, so two units are provided that are each shared among two threads.The T unit, which performs the Wide Translate instruction, is expensiveand utilization is low, so the single unit is shared among all fourthreads. The E unit, which performs the class of Ensemble instructions,is very expensive in area and power compared to the other functionalunits, but utilization is relatively high, so we provide two such units,each unit shared by two threads.

In FIG. 4, four copies of an access unit are shown, each with an accessinstruction fetch queue A-Queue 401-404, coupled to an access registerfile AR 405-408, each of which is, in turn, coupled to two accessfunctional units A 409-416. The access units function independently forfour simultaneous threads of execution. These eight access functionalunits A 409-416 produce results for access register files AR 405-408 andaddresses to a shared memory system 417. The memory contents fetchedfrom memory system 417 are combined with execute instructions notperformed by the access unit and entered into the four executeinstruction queues E-Queue 421-424. Instructions and memory data fromE-queue 421-424 are presented to execution register files 425-428, whichfetches execution register file source operands. The instructions arecoupled to the execution unit arbitration unit Arbitration 431, thatselects which instructions from the four threads are to be routed to theavailable execution units E 441 and 449, X 442 and 448, 0443-444 and446-447, and T 445. The execution register file source operands ER425-428 are coupled to the execution units 441-445 using source operandbuses 451-454 and to the execution units 445-449 using source operandbuses 455-458. The function unit result operands from execution units441-445 are coupled to the execution register file using result bus 461and the function units result operands from execution units 445-449 arecoupled to the execution register file using result bus 462.

In a still further aspect of the present invention, an improvedinterprivilege gateway is described which involves increased parallelismand leads to enhanced performance. In U.S. application Ser. No.08/541,416, now U.S. Pat. No. 6,101,590, a system and method isdescribed for implementing an instruction that, in a controlled fashion,allows the transfer of control (branch) from a lower-privilege level toa higher-privilege level. Embodiment of the present invention providesan improved system and method for a modified instruction thataccomplishes the same purpose but with specific advantages.

Many processor resources, such as control of the virtual memory systemitself, input and output operations, and system control functions areprotected from accidental or malicious misuse by enclosing them in aprotective, privileged region. Entry to this region must be establishedonly though particular entry points, called gateways, to maintain theintegrity of these protected regions.

Prior art versions of this operation generally load an address from aregion of memory using a protected virtual memory attribute that is onlyset for data regions that contain valid gateway entry points, thenperform a branch to an address contained in the contents of memory.Basically, three steps were involved: load, branch, then check. Comparedto other instructions, such as register-to-register computationinstructions and memory loads and stores, and register-based branches,this is a substantially longer operation, which introduces delays andcomplexity to a pipelined implementation.

In the present invention, the branch-gateway instruction performs twooperations in parallel: 1) a branch is performed to the contents ofregister 0 and 2) a load is performed using the contents of register 1,using a specified byte order (little-endian) and a specified size (64bits). If the value loaded from memory does not equal the contents ofregister 0, the instruction is aborted due to an exception. In addition,3) a return address (the next sequential instruction address followingthe branch-gateway instruction) is written into register 0, provided theinstruction is not aborted. This approach essentially uses a firstinstruction to establish the requisite permission to allow user code toaccess privileged code, and then a second instruction is permitted tobranch directly to the privileged code because of the permissions issuedfor the first instruction.

In the present invention, the new privilege level is also contained inregister 0, and the second parallel operation does not need to beperformed if the new privilege level is not greater than the oldprivilege level. When this second operation is suppressed, the remainderof the instruction performs an identical function to a branch-linkinstruction, which is used for invoking procedures that do not requirean increase in privilege. The advantage that this feature brings is thatthe branch-gateway instruction can be used to call a procedure thatmayor may not require an increase in privilege.

The memory load operation verifies with the virtual memory system thatthe region that is loaded has been tagged as containing valid gatewaydata. A further advantage of the present invention is that the calledprocedure may rely on the fact that register 1 contains the address thatthe gateway data was loaded from, and can use the contents of register 1to locate additional data or addresses that the procedure may require.Prior art versions of this instruction required that an additionaladdress be loaded from the gateway region of memory in order toinitialize that address in a protected manner—the present inventionallows the address itself to be loaded with a “normal” load operationthat does not require special protection.

The present invention allows a “normal” load operation to also load thecontents of register 0 prior to issuing the branch-gateway instruction.The value may be loaded from the same memory address that is loaded bythe branch-gateway instruction, because the present invention contains avirtual memory system in which the region may be enabled for normal loadoperations as well as the special “gateway” load operation performed bythe branch-gateway instruction.

In a further aspect of the present invention, a system and method isprovided for performing a three-input bitwise Boolean operation in asingle instruction. A novel method described in detail in anothersection is used to encode the eight possible output states of such anoperation into only seven bits, and decoding these seven bits back intothe eight states.

In yet a further aspect to the present invention, a system and method isdescribed for improving the branch prediction of simple repetitive loopsof code. The method includes providing a count field for indicating howmany times a branch is likely to be taken before it is not taken, whichenhances the ability to properly predict both the initial and finalbranches of simple loops when a compiler can determine the number ofiterations that the loop will be performed. This improves performance byavoiding misprediction of the branch at the end of a loop.

Pipeline Organization

Zeus performs all instructions as if executed one-by-one, in-order, withprecise exceptions always available. Consequently, code that ignores thesubsequent discussion of Zeus pipeline implementations will stillperform correctly. However, the highest performance of the Zeusprocessor is achieved only by matching the ordering of instructions tothe characteristics of the pipeline. In the following discussion, thegeneral characteristics of all Zeus implementations precede discussionof specific choices for specific implementations.

Classical Pipeline Structures

Pipe lining in general refers to hardware structures that overlapvarious stages of execution of a series of instructions so that the timerequired to perform the series of instructions is less than the sum ofthe times required to perform each of the instructions separately.Additionally, pipelines carry to connotation of a collection of hardwarestructures which have a simple ordering and where each structureperforms a specialized function.

FIG. 107 shows the timing of what has become a canonical pipelinestructure for a simple RISC processor, with time on the horizontal axisincreasing to the right, and successive instructions on the verticalaxis going downward. The stages I, R, E, M, and W refer to units whichperform instruction fetch, register file fetch, execution, data memoryfetch, and register file write. The stages are aligned so that theresult of the execution of an instruction may be used as the source ofthe execution of an immediately following instruction, as seen by thefact that the end of an E stage (bold in line 1) lines up with thebeginning of the E stage (bold in line 2) immediately below. Also, itcan be seen that the result of a load operation executing in stages Eand M (bold in line 3) is not available in the immediately followinginstruction (line 4), but may be used two cycles later (line 5); this isthe cause of the load delay slot seen on some RISC processors.

In FIG. 108, we simplify the diagrams somewhat by eliminating the pipestages for instruction fetch, register file fetch, and register filewrite, which can be understood to precede and follow the portions of thepipelines diagrammed. The diagram above is shown again in this newformat, showing that the canonical pipeline has very little overlap ofthe actual execution of instructions.

A superscalar pipeline is one capable of simultaneously issuing two ormore instructions which are independent, in that they can be executed ineither order and separately, producing the same result as if they wereexecuted serially. FIG. 109 shows a two-way superscalar processor, whereone instruction may be a register-to-register operation (using stage E)and the other may be a register-to-register operation (using stage A) ora memory load or store (using stages A and M).

A superpipelined pipeline is one capable is issuing simple instructionsfrequently enough that the result of a simple instruction must beindependent of the immediately following one or more instructions. FIG.110 shows two-cycle superpipelined implementation.

In the diagrams below, pipeline stages are labelled with the type ofinstruction that may be performed by that stage. The position of thestage further identifies the function of that stage, as for example aload operation may require several L stages to complete the instruction.

Superstring Pipeline

Zeus architecture provides for implementations designed to fetch andexecute several instructions in each clock cycle. For a particularordering of instruction types, one instruction of each type may beissued in a single clock cycle. The ordering required is A, L, E, S, B;in other words, a register-to-register address calculation, a memoryload, a register-to-register data calculation, a memory store, and abranch. Because of the organization of the pipeline, each of theseinstructions may be serially dependent. Instructions of type E includethe fixed-point execute-phase instructions as well as floating-point anddigital signal processing instructions. We call this form of pipelineorganization “superstring,” (readers with a background in theoreticalphysics may have seen this term in an other, unrelated, context) becauseof the ability to issue a string of dependent instructions in a singleclock cycle, as distinguished from superscalar or superpipelinedorganizations, which can only issue sets of independent instructions.

These instructions take from one to four cycles of latency to execute,and a branch prediction mechanism is used to keep the pipeline filled.FIG. 111 shows a box for the interval between issue of each instructionand the completion. Bold letters mark the critical latency paths of theinstructions, that is, the periods between the required availability ofthe source registers and the earliest availability of the resultregisters. The A-L critical latency path is a special case, in which theresult of the A instruction may be used as the base register of the Linstruction without penalty. E instructions may require additionalcycles of latency for certain operations, such as fixed-point multiplyand divide, floating-point and digital signal processing operations.

Superspring Pipeline

Zeus architecture provides an additional refinement to the organizationdefined above, in which the time permitted by the pipeline to serviceload operations may be flexibly extended. Thus, the front of thepipeline, in which A, Land B type instructions are handled, is decoupledfrom the back of the pipeline, in which E, and S type instructions arehandled as shown in FIG. 112. This decoupling occurs at the point atwhich the data cache and its backing memory is referenced; similarly, aFIFO that is filled by the instruction fetch unit decouples instructioncache references from the front of the pipeline shown above. The depthof the FIFO structures is implementation-dependent, i.e. not fixed bythe architecture.

FIG. 13 indicates why we call this pipeline organization feature“superspring,” an extension of our superstring organization.

With the super-spring organization, the latency of load instructions canbe hidden, as execute instructions are deferred until the results of theload are available. Nevertheless, the execution unit still processesinstructions in normal order, and provides precise exceptions.

Superthread Pipeline

This technique is not employed in the initial Zeus implementation,though it was present in an earlier prototype implementation.

A difficulty of superpipelining is that dependent operations must beseparated by the latency of the pipeline, and for highly pipelinedmachines, the latency of simple operations can be quite significant. TheZeus “superthread” pipeline provides for very highly pipelinedimplementations by alternating execution of two or more independentthreads. In this context, a thread is the state required to maintain anindependent execution; the architectural state required is that of theregister file contents, program counter, privilege level, local TB, andwhen required, exception status. Ensuring that only one thread mayhandle an exception at one time may minimize the latter state, exceptionstatus. In order to ensure that all threads make reasonable forwardprogress, several of the machine resources must be scheduled fairly.

An example of a resource that is critical that it be fairly shared isthe data memory/cache subsystem. In a prototype implementation, Zeus isable to perform a load operation only on every second cycle, and a storeoperation only on every fourth cycle. Zeus schedules these fixed timingresources fairly by using a round-robin schedule for a number of threadsthat is relatively prime to the resource reuse rates. For thisimplementation, five simultaneous threads of execution ensure thatresources which may be used every two or four cycles are fairly sharedby allowing the instructions which use those resources to be issued onlyon every second or fourth issue slot for that thread.

In FIG. 113, the thread number which issues an instruction is indicatedon each clock cycle, and below it, a list of which functional units maybe used by that instruction. The diagram repeats every 20 cycles, socycle 20 is similar to cycle 0, cycle 21 is similar to cycle 1, etc.This schedule ensures that no resource conflict occur between threadsfor these resources. Thread 0 may issue an E, L, S or B on cycle 0, buton its next opportunity, cycle 5, may only issue E or B, and on cycle 10may issue E, L or B, and on cycle 15, may issue E or B.

As shown in FIG. 114, when seen from the perspective of an individualthread, the resource use diagram looks similar to that of thecollection. Thus an individual thread may use the load unit every twoinstructions, and the store unit every four instructions.

A Zeus Superthread pipeline, with 5 simultaneous threads of execution,permits simple operations, such as register-to-register add (G.ADD), totake 5 cycles to complete, allowing for an extremely deeply pipelinedimplementation.

Simultaneous Multithreading

The initial Zeus implementation performs simultaneous multithreadingamong 4 threads. Each of the 4 threads share a common memory system, acommon T unit. Pairs of threads share two G units, one X unit, and one Eunit. Each thread individually has two A units. A fair allocation schemebalances access to the shared resources by the four threads.

Branch/Fetch Prediction

Zeus does not have delayed branch instructions, and so relies uponbranch or fetch prediction to keep the pipeline full aroundunconditional and conditional branch instructions. In the simplest formof branch prediction, as in Zeus's first implementation, a takenconditional backward (toward a lower address) branch predicts that afuture execution of the same branch will be taken. More elaborateprediction may cache the source and target addresses of multiplebranches, both conditional and unconditional, and both forward andreverse.

The hardware prediction mechanism is tuned for optimizing conditionalbranches that close loops or express frequent alternatives, and willgenerally require substantially more cycles when executing conditionalbranches whose outcome is not predominately taken or not-taken.

For such cases of unpredictable conditional results, the use of codethat avoids conditional branches in favor of the use of compare-set andmultiplex instructions may result in greater performance.

Under some conditions, the above technique may not be applicable, forexample if the conditional branch “guards” code which cannot beperformed when the branch is taken. This may occur, for example, when aconditional branch tests for a valid (non-zero) pointer and theconditional code performs a load or store using the pointer. In thesecases, the conditional branch has a small positive offset, but isunpredictable. A Zeus pipeline may handle this case as if the branch isalways predicted to be not taken, with the recovery of a mispredictioncausing cancellation of the instructions which have already been issuedbut not completed which would be skipped over by the taken conditionalbranch. This “conditional-skip” optimization is performed by the initialZeus implementation and requires no specific architectural feature toaccess or ‘implement.

A Zeus pipeline may also perform “branch-return” optimization, in whicha branch-link instruction saves a branch target address that is used topredict the target of the next returning branch instruction. Thisoptimization may be, implemented with a depth of one (only one returnaddress kept), or as a stack of finite depth, where a branch and linkpushes onto the stack, and a branch-register pops from the stack. Thisoptimization can eliminate the misprediction cost of simple procedurecalls, as the calling branch is susceptible to hardware prediction, andthe returning branch is predictable by the branch-return optimization.Like the conditional-skip optimization described above, this feature isperformed by the initial Zeus implementation and requires no specificarchitectural feature to access or implement.

Zeus implements two related instructions that can eliminate or reducebranch delays for conditional loops, conditional branches, and computedbranches. The “branch-hint” instruction has no effect on architecturalstate, but informs the instruction fetch unit of a potential futurebranch instruction, giving the addresses of both the branch instructionand of the branch target. The two forms of the instruction specify thebranch instruction address relative to the current address as animmediate field, and one form (branch-hint-immediate) specifies thebranch target address relative to the current address as an immediatefield, and the other (branch-hint) specifies the branch target addressfrom a general register. The branch-hint-immediate instruction isgenerally used to give advance notice to the instruction fetch unit of abranch-conditional instruction, so that instructions at the target ofthe branch can be fetched in advance of the branch-conditionalinstruction reaching the execution pipeline. Placing the branch hint asearly as possible, and at a point where the extra instruction will notreduce the execution rate optimizes performance. In other words, anoptimizing compiler should insert the branch-hint instruction as earlyas possible in the basic block where the parcel will contain at most oneother “front-end” instruction.

Result Forwarding

When temporally adjacent instructions are executed by separateresources, the results of the first instruction must generally beforwarded directly to the resource used to execute the secondinstruction, where the result replaces a value which may have beenfetched from a register file. Such forwarding paths use significantresources. A Zeus implementation must generally provide forwardingresources so that dependencies from earlier instructions within a stringare immediately forwarded to later instructions, except between a firstand second execution instruction as described above. In addition, whenforwarding results from the execution units back to the data fetch unit,additional delay may be incurred.

Memory Management

This section discusses the caches, the translation mechanisms, thememory interfaces, and how the multiprocessor interface is used tomaintain cache coherence.

Overview

FIG. 14 is a diagram illustrating the basic organization of the memorymanagement system according to one embodiment of the invention. Inaccordance with this embodiment, the Zeus processor provides for bothlocal and global virtual addressing, arbitrary page sizes, andcoherent-cache multiprocessing. The memory management system is designedto provide the requirements for implementation of virtual machines aswell as virtual memory. All facilities of the memory management systemare themselves memory mapped, in order to provide for the manipulationof these facilities by high-level language, compiled code. Thetranslation mechanism is designed to allow full byte-at-a-time controlof access to the virtual address space, with the assistance of fastexception handlers. Privilege levels provide for the secure transitionbetween insecure user code and secure system facilities. Instructionsexecute at a privilege, specified by a two-bit field in the accessinformation. Zero is the least-privileged level, and three is themost-privileged level.

In general terms, the memory management starts from a local virtualaddress. The local virtual address is translated to a global virtualaddress by an L TB (Local Translation Buffer). In turn, the globalvirtual address is translated to a physical address by a GTB (GlobalTranslation Buffer). One of the addresses, a local virtual address, aglobal virtual address, or a physical address, is used to index thecache data and cache tag arrays, and one of the addresses is used tocheck the cache tag array for cache presence. Protection information isassembled from the L TB, GTB, and optionally the cache tag, to determineif the access is legal.

This form varies somewhat, depending on implementation choices made.Because the L TB leaves the lower 48 bits of the address alone, indexingof the cache arrays with the local virtual address is usually identicalto cache arrays indexed by the global virtual address. However, indexingcache arrays by the global virtual address rather than the physicaladdress produces a coherence issue if the mapping from global virtualaddress to physical is many-to-one.

Starting from a local virtual address, the memory management systemperforms three actions in parallel: the low-order bits of the virtualaddress are used to directly access the data in the cache, a low-orderbit field is used to access the cache tag, and the high-order bits ofthe virtual address are translated from a local address space to aglobal virtual address space.

Following these three actions, operations vary depending upon the cacheimplementation. The cache tag may contain either a physical address andaccess control information (a physically-tagged cache), or may contain aglobal virtual address and global protection information (avirtually-tagged cache).

For a physically-tagged cache, the global virtual address is translatedto a physical address by the GTB, which generates global protectioninformation. The cache tag is checked against the physical address, todetermine a cache hit. In parallel, the local and global protectioninformation is checked.

For a virtually-tagged cache, the cache tag is checked against theglobal virtual address, to determine a cache hit, and the local andglobal protection information is checked. If the cache misses, theglobal virtual address is translated to a physical address by the GTB,which also generates the global protection information.

Local Translation Buffer

The 64-bit global virtual address space is global among all tasks. In amultitask environment, requirements for a task-local address space arisefrom operations such as the UNIX “fork” function, in which a task isduplicated into parent and child tasks, each now having a unique virtualaddress space. In addition, when switching tasks, access to one task'saddress space must be disabled and another task's access enabled.

Zeus provides for portions of the address space to be made local toindividual tasks, with a translation to the global virtual spacespecified by four 16-bit registers for each local virtual space. Theregisters specify a mask selecting which of the high-order 16 addressbits are checked to match a particular value, and if they match, a valuewith which to modify the virtual address. Zeus avoids setting a fixedpage size or local address size; these can be set by softwareconventions.

A local virtual address space is specified by the following:

Local Virtual Address Space Specifiers

field name size Description lm 16 mask to select fields of local virtualaddress to perform match over la 16 value to perform match with maskedlocal virtual address lx 16 value to xor with local virtual address ifmatched lp 16 local protection field (detailed later)

Physical Address

There are as many LTB as threads, and up to 2³ (8) entries per L TB.Each entry is 128 bits, with the high order 64 bits reserved. FIG. 15illustrates the physical address of a LTB entry for thread th, entry en,byte b.

Definition

FIG. 16 illustrates a definition for AccessPhysicalLTB.

Entry Format

FIG. 17 illustrates how various 16-bit values are packed together into a64-bit LTB entry. The L TB contains a separate context of register setsfor each thread, indicated by the th index above. A context consists ofone or more sets of lm/la/lx/lp registers, one set for eachsimultaneously accessible local virtual address range, indicated by theen index above. This set of registers is called the “Local TB context,”or LTB (Local Translation Buffer) context. The effect of this mechanismis to provide the facilities normally attributed to segmentation.However, in this system there is no extension of the address range,instead, segments are local nicknames for portions of the global virtualaddress space.

A failure to match a L TB entry results either in an exception or anaccess to the global virtual address space, depending on privilegelevel. A single bit, selected by the privilege level active for theaccess from a four bit control register field, global access, gadetermines the result. If ga_(pL) is zero (0), the failure causes anexception, if it is one (1), the failure causes the address to bedirectly used as a global virtual address without modification.

FIG. 18 illustrates global access as fields of a control register.Usually, global access is a right conferred to highly privilege levels,so a typical system may be configured with ga0 and ga1 clear (0), butga2 and ga3 set (1). A single low-privilege (0) task can be safelypermitted to have global access, as accesses are further limited by therwxg privilege fields. A concrete example of this is an emulation task,which may use global addresses to simulate segmentation, such as an x86emulation. The emulation task then runs as privilege 0, with ga0 set,while most user tasks run as privilege 1, with ga1 clear. Operatingsystem tasks then use privilege 2 and 3 to communicate with and controlthe user tasks, with ga2 and ga3 set.

For tasks that have global access disabled at their current privilegelevel, failure to match a LTB entry causes an exception. The exceptionhandler may load a LTB entry and continue execution, thus providingaccess to an arbitrary number of local virtual address ranges.

When failure to match a LTB entry does not cause an exception,instructions may access any region in the local virtual address space,when a LTB entry matches, and may access regions in the global virtualaddress space when no LTB entry matches. This mechanism permitsprivileged code to make judicious use of local virtual address ranges,which simplifies the manner in which privileged code may manipulate thecontents of a local virtual address range on behalf of a less-privilegedclient. Note, however, that under this model, an LTB miss does not causean exception directly, so the use of more local virtual address rangesthan LTB entries requires more care: the local virtual address rangesshould be selected so as not to overlap with the global virtual addressranges, and GTB misses to LVA regions must be detected and cause thehandler to load an LTB entry.

Each thread has an independent L TB, so that threads may independentlydefine local translation. The size of the LTB for each thread isimplementation dependent and defined as the LE parameter in thearchitecture Description register. LE is the log of the number ofentries in the local TB per thread; an implementation may define LE tobe a minimum of 0, meaning one LTB entry per thread, or a maximum of 3,meaning eight LTB entries per thread. For the initial Zeusimplementation, each thread has two entries and LE=1.

A minimum implementation of a LTB context is a single set of lm/la/lx/lpregisters per thread. However, the need for the LTB to translate both,code addresses and data addresses imposes some limits on the use of theLTB in such systems. We need to be able to guarantee forward progress.With a single LTB set per thread, either the code or the data must useglobal addresses, or both must use the same local address range, as mustthe LTB and GTB exception handler. To avoid this restriction, theimplementation must be raised to two sets per thread, at least one forcode and one for data, to guarantee forward progress for arbitrary useof local addresses in the user code (but still be limited to usingglobal addresses for exception handlers).

As shown in FIG. 19, a single-set LTB context may be further simplifiedby reserving the implementation of the lm and la registers, setting themto a read-only zero value: Note that in such a configuration, only asingle LA region can be implemented.

If the largest possible space is reserved for an address spaceidentifier, the virtual address is partitioned as shown in FIG. 20. Anyof the bits marked as “local” below may be used as “offset” as desired.

To improve performance, an implementation may perform the L TBtranslation on the value of the base register (rc) or unincrementedprogram counter, provided that a check is performed which prohibitschanging the unmasked upper 16 bits by the add or increment. If thisoptimization is provided and the check fails, anAccessDisallowedByVirtualAddress should be signaled. If thisoptimization is provided, the architecture Description parameter LB=1.Otherwise LTB translation is performed on the local address, la, nochecking is required, and LB=0.

As shown in FIG. 21, the LTB protect field controls the minimumprivilege level required for each memory action of read (r), write (w),execute (x), and gateway (g), as well as memory and cache attributes ofwrite allocate (wa), detail access (da), strong ordering (so), cachedisable (cd), and write through (wt). These fields are combined withcorresponding bits in the GTB protect field to control these attributesfor the mapped memory region.

Field Description

The meaning of the fields are given by the following table:

name size meaning g 2 minimum privilege required for gateway access x 2minimum privilege required for execute access w 2 minimum privilegerequired for write access r 2 minimum privilege required for read access0 1 reserved da 1 detail access so 1 strong ordering cc 3 cache control

Definition

FIG. 22 illustrates a definition for LocalTranslation.

Global Translation Buffer

Global virtual addresses which fail to be accessed in either the LZC,the MTB, the BTB, or PTB are translated to physical references in atable, here named the “Global Translation Buffer,” (GTB).

Each processor may have one or more GTB's, with each GTB shared by oneor more threads. The parameter GT, the base-two log of the number ofthreads which share a GTB, and the parameter T, the number of threads,allow computation of the number of GTBs (T/2^(GT)), and the number ofthreads which share each GTB (2^(GT)).

If there are two GTBs and four threads (GT=1, T=4), GTB 0 servicesreferences from threads 0 and 1, and GTB 1 services references fromthreads 2 and 3. In the first implementation, there is one GTB, sharedby all four threads. (GT=2, T=4). The GTB has 128 entries (G=7).

Per clock cycle, each GTB can translate one global virtual address to aphysical address, yielding protection information as a side effect.

A GTB miss causes a software trap. This trap is designed to permit afast handler for GlobalTBMiss to be written in software, by permitting asecond GTB miss to occur as an exception, rather than a machine check.

Physical Address

There may be as many GTB as threads, and up to is entries per GTB. FIG.23 illustrates the physical address of a GTB entry for thread th, entryen, byte b. Note that in FIG. 23, the low-order GT bits of the th valueare ignored, reflecting that 2^(GT) threads share a single GTB. A singleGTB shared between threads appears multiple times in the address space.Referring to FIG. 24, GTB entries are packed together so that entries ina GTB are consecutive.

Definition

FIG. 24 illustrates a definition for AccessPhysicalGTB. FIG. 25illustrates the format of a GTB entry:

Entry Format

As shown, each GTB entry is 128 bits.

Field Description

gs=ga+size/2: 256≦size≦2⁶⁴, ga, global address, is aligned (a multipleof) size.

px=pâga, pa, ga, and px are all aligned (a multiple of) size.

The meaning of the fields are given by the following table:

name size meaning gs 57 global address with size px 56 physical xor g 2minimum privilege required for gateway access x 2 minimum privilegerequired for execute access w 2 minimum privilege required for writeaccess r 2 minimum privilege required for read access 0 1 reserved da 1detail access so 1 strong ordering cc 3 cache control

If the entire contents of the GTB entry is zero (0), the entry will notmatch any global address at all. If a zero value is written, a zerovalue is read for the GTB entry. Software must not write a zero valuefor the gs field unless the entire entry is a zero value.

It is an error to write GTB entries that multiply match any globaladdress; all GTB entries must have unique, non-overlapping coverage ofthe global address space. Hardware may produce a machine check if suchoverlapping coverage is detected, or may produce any physical addressand protection information and continue execution.

Limiting the GTB entry size to 128 bits allows up to replace entriesatomically (with a single store operation), which is less complex thanthe previous design, in which the mask portion was first reduced, thenother entries changed, then the mask is expanded. However, it islimiting the amount of attribute information or physical address rangewe can specify. Consequently, we are encoding the size as a singleadditional bit to the global address in order to allow for attributeinformation.

Definition

FIG. 26 illustrates a definition for GlobalAddressTranslation.

GTB Registers

Because the processor contains multiple threads of execution, even whentaking virtual memory exceptions, it is possible for two threads tonearly simultaneously invoke software GTB miss exception handlers forthe same memory region. In order to avoid producing improper GTB statein such cases, the GTB includes access facilities for indivisiblychecking and then updating the contents of the GTB as a result of amemory write to specific addresses.

A 128-bit write to the address GTBUpdateFill (fill=1), as a side effect,causes first a check of the global address specified in the data againstthe GTB. If the global address check results in a match, the data isdirected to write on the matching entry. If there is no match, theaddress specified by GTBLast is used, and GTBLast is incremented. Ifincrementing GTBLast results in a zero value, GTBLast is reset toGTBFirst, and GTBBump is set. Note that if the size of the updated valueis not equal to the size of the matching entry, the global address checkmay not adequately ensure that no other entries also cover the addressrange of the updated value. The operation is unpredictable if multipleentries match the global address.

The GTBUpdateFill register is a 128-bit memory-mapped location, to whicha write operation performs the operation defined above. A read operationreturns a zero value. The format of the GTBUpdateFill register isidentical to that of a GTB entry.

An alternative write address, GTBUpdate, (fill=0) updates a matchingentry, but makes no change to the GTB if no entry matches. Thisoperation can be used to indivisibly update a GTB entry as to protectionor physical address information.

Definition

FIG. 27 illustrates a definition for GTBUpdateWrite.

Physical Address

There may be as many GTB as threads, and up to 2¹¹ registers per GTB (5registers are implemented). FIG. 28 illustrates the physical address ofa GTB control register for thread th, register m, byte b. Note that inFIG. 28, the low-order GT bits of the th value are ignored, reflectingthat 2^(GT) threads share single GTB registers. A single set of GTBregisters shared between threads appears multiple times in the addressspace, and manipulates the GTB of the threads with which the registersare associated.

The GTBUpdate register is a 128-bit memory-mapped location, to which awrite operation performs the operation defined above. A read operationreturns a zero value. The format of the GTBUpdateFill register isidentical to that of a GTB entry. FIG. 29 illustrates the registersGTBLast, GTBFirst, and GTBBump. The registers GTBLast, GTBFirst, andGTBBump are memory mapped. As shown in FIG. 29, the GTBLast and GTBFirstregisters are G bits wide, and the GTBBump register is one bit.

Definition

FIG. 30 illustrates a definition for AccessPhysicalGTBRegisters.

Address Generation

The address units of each of the four threads provide up to two globalvirtual addresses of load, store, or memory instructions, for a total ofeight addresses. LTB units associated with each thread translate thelocal addresses into global addresses. The LZC operates on globaladdresses. MTB, BTB, and PTB units associated with each thread translatethe global addresses into physical addresses and cache addresses. (A PTBunit associated with each thread produces physical addresses and cacheaddresses for program counter references.—this is optional, as bylimiting address generation to two per thread, the MTB can be used forprogram references.) Cache addresses are presented to the LOC asrequired, and physical addresses are checked against cache tags asrequired.

Memory Banks

The LZC has two banks, each servicing up to four requests. The LOC haseight banks, each servicing at most one request.

Assuming random request addresses, FIG. 55 shows the expected rate atwhich requests are serviced by multi-bank/multi-port memories that have8 total ports and divided into 1, 2, 4, or 8 interleaved banks. The LZCis 2 banks, each with 4 ports, and the LOC is 8 banks, each 1 port.

Note a small difference between applying 12 references versus 8references for the LOC (6.5 vs 5.2), and for the LZC (7.8 vs. 6.9). Thissuggests that simplifying the system to produce two address per thread(program+load/store or two load/store) will not overly hurt performance.A closer simulation, taking into account the sequential nature of theprogram and load/store traffic may well yield better numbers, as threadswill tend to line up in non-interfering patterns, and programmicrocaching reduces program fetching.

FIG. 56 shows the rates for both 8 total ports and 16 total ports.

Note significant differences between 8-port systems and 16-port systems,even when used with a maximum of 8 applied references. In particular, a16-bank 1-port system is better than a 4-bank 2-port system with morethan 6 applied references. Current layout estimates would require abouta 14% area increase (assuming no savings from smaller/simpler senseamps) to switch to a 16-port LOC, with a 22% increase in 8-referencethroughput.

Program Microcache

A program micro cache (PMC) which holds only program code for eachthread may optionally exist, and does exist for the initialimplementation. The program microcache is flushed by reset, or byexecuting a B.BARRIER instruction. The program microcache is alwaysclean, and is not snooped by writes or otherwise kept coherent, exceptby flushing as indicated above. The microcache is not altered by writingto the L TB or GTB, and software must execute a B.BARRIER instructionbefore expecting the nEw contents of the L TB or GTB to affectdetermination of PMC hit or miss status on program fetches.

In the initial implementation, the program microcache holds simple loopcode. The microcache holds two separately addressed cache lines.Branches or execution beyond this region cause the microcache to beflushed and refilled at the new address, provided that the addresses areexecutable by the current thread. The program microcache uses the B.HINTand B.HINT.I to accelerate fetching of program code when possible. Theprogram microcache generally functions as a prefetch buffer, except thatshort forward or backward branches within the region covered maintainthe contents of the microcache.

Program fetches into the microcache are requested on any cycle in whichless than two load/store addresses are generated by the address unit,unless the microcache is already full. System arbitration logic shouldgive program fetches lower priority than load/store references whenfirst presented, then equal priority if the fetch fails arbitration acertain number of times. The delay until program fetches have equalpriority should be based on the expected time the program fetch datawill be executed; it may be as small as a single cycle, or greater forfetches which are far ahead of the execution point.

Wide Microcache

A wide microcache (WMC) which holds only data fetched for wide (W)instructions may optionally exist, and does exist for the initialimplementation, for each unit which implements one or more wide (W)instructions.

The wide (W) instructions each operate on a block of data fetched frommemory and the contents of one or more registers, producing a result ina register. Generally, the amount of data in the block exceeds themaximum amount of data that the memory system can supply in a singlecycle, so caching the memory data is of particular importance. All thewide (W) instructions require that the memory data be located at analigned address, an address that is a multiple of the size of the memorydata, which is always a power of two.

The wide (W) instructions are performed by functional units whichnormally perform execute or “back-end” instructions, though the loadingof the memory data requires use of the access or “front-end” functionalunits. To minimize the use of the “front-end” functional units, specialrules are used to maintain the coherence of a wide microcache (WMC).

Execution of a wide (W) instruction has a residual effect of loading thespecified memory data into a wide microcache (WMC). Under certainconditions, a future wide (W) instruction may be able to reuse the WMCcontents.

First of all, any store or cache coherency action on the physicaladdresses referenced by the WMC will invalidate the contents. Theminimum translation unit of the virtual memory system, 256 bytes,defines the number of physical address blocks which must be checked byany store. A WMC for the W.TABLE instruction may be as large as 4096bytes, and so requires as many as 16 such physical address blocks to bechecked for each WMC entry. A WMC for the W.SWITCH or W.MUL. *instructions need check only one address block for each WMC entry, asthe maximum size is 128 bytes.

By making these checks on the physical addresses, we do not need to beconcerned about changes to the virtual memory mapping from virtual tophysical addresses, and the virtual memory state can be freely changedwithout invalidating any WMC.

Absent any of the above changes, the WMC is only valid if it containsthe contents relevant to the current wide (W) instruction. To check thiswith minimal use of the front-end units, each WMC entry contains a firsttag with the thread and address register for which it was last used. Ifthe current wide (W) instruction uses the same thread and addressregister, it may proceed safely. Any intervening writes to that addressregister by that thread invalidates the WMC thread and address registertag.

If the above test fails, the front-end is used to fetch the addressregister and check its contents against a second WMC tag, with thephysical addresses for which it was last used. If the tag matches, itmay proceed safely. As detailed above, any intervening stores or cachecoherency action by any thread to the physical addresses invalidates theWMC entry.

If both the above tests fail for all relevant WMC entries, there is noalternative but to load the data from the virtual memory system into theWMC. The front-end units are responsible for generating the necessaryaddresses to the virtual memory system to fetch the entire data blockinto a WMC.

For the first implementation, it is anticipated that there be eight WMCentries for each of the two X units (for W.SWITCH instructions), eightWMC entries for each of the two E units (for W.MUL instructions), andfour WMC entries for the single T unit. The total number of WMC addresstags requires is 8*2*1+8*2*1+4*1*16=96 entries.

The number of WMC address tags can be substantially reduced to 32+4=36entries by making an implementation restriction requiring that a singletranslation block be used to translate the data address of W.TABLEinstructions. With this restriction, each W.TABLE WMC entry uses acontiguous and aligned physical data memory block, for which a singleaddress tag can contain the relevant information. The size of such ablock is a maximum of 4096 bytes. The restriction can be checked byexamining the size field of the referenced GTB entry.

Level Zero Cache

The innermost cache level, here named the “Level Zero Cache,” (LZC) isfully associative and indexed by global address. Entries in the LZCcontain global addresses and previously fetched data from the memorysystem. The LZC is an implementation feature, not visible to the Zeusarchitecture.

Entries in the LZC are also used to hold the global addresses of storeinstructions that have been issued, but not yet completed in the memorysystem. The LZC entry may also contain the data associated with theglobal address, as maintained either before or after updating with thestore data. When it contains the post-store data, results of stores maybe forwarded directly to the requested reference.

With an LZC hit, data is returned from the LZC data, and protection fromthe LZC tag. No LOC access is required to complete the reference.

All loads and program fetches are checked against the LZC for conflictswith entries being used as store buffer. On a LZC hit on such entries,if the post-store data is present, data may be returned by the LZC tosatisfy the load or program fetch. If the post-store data is notpresent, the load or program fetch must stall until the data isavailable.

With an LZC miss, a victim entry is selected, and if dirty, the victimentry is written to the LOC. The LOC cache is accessed, and a valid LZCentry is constructed from data from the LOC and tags from the LOCprotection information.

All stores are checked against the LZC for conflicts, and further causea new entry in the LZC, or “take over” a previously clean LZC entry forthis purpose. Unaligned stores may require two entries in the LZC. Attime of allocation, the address is filled in.

Two operations then occur in parallel −1) for write-back cachedreferences, the remaining bytes of the hexlet are loaded from the LOC(or LZC), and 2) the addressed bytes are filled in with data from datapath. If an exception causes the store to be purged before retirement,the LZC entry is marked invalid, and not written back. When the store isretired, the LZC entry can be written back to LOC or external interface.

Structure

The eight memory addresses are partitioned into up to four oddaddresses, and four even addresses.

The LZC contains 16 fully associative entries that may each contain asingle hex let of data at even hexlet addresses (LZCE), and another 16entries for odd hex let addresses (LZCO). The maximum capacity of theLZC is 16*32=512 bytes.

The tags for these entries are indexed by global virtual address(63..5), and contain access control information, detailed below.

The address of entries accessed associatively is also encoded intobinary and provided as output from the tags for use in updating the LZC,through its write ports.

8 bit rwxg16 bit valid16 bit dirty4 bit L0$ address16 bit protection

  def data,protect,valid,dirty,match ← LevelZeroCacheRead(ga) as  eo ←ga₄  match ← NONE  for i ← 0 to LevelZeroCacheEntries/2-1   if(ga_(63..5) = LevelZeroTag[eo][i] then    match ← i   endif  endfor  ifmatch = NONE then   raise LevelZeroCacheMiss  else   data ←LevelZeroData[eo][match]_(127..0)   valid←LevelZeroData[eo][match]_(143..128)   dirty ←LevelZeroData[eo][match]_(159..144)   protect ←LevelZeroData[eo][match]167..160  endif enddef

Level One Cache

The next cache level, here named the “Level One Cache,” (LOC) isfour-set-associative and indexed by the physical address. The eightmemory addresses are partitioned into up to eight addresses for each ofeight independent memory banks. The LOC has a cache block size of 256bytes, with triclet (32-byte) sub-blocks.

The LOC may be partitioned into two sections, one part used as a cache,and the remainder used as “niche memory.” Niche memory is at least asfast as cache memory, but unlike cache, never misses to main memory.Niche memory may be placed at any virtual address, and has physicaladdresses fixed in the memory map. The 01 field in the control registerconfigures the partitioning of LOC into cache memory and niche memory.

The LOC data memory is (256+8)×4×(128+2) bits, depth to hold 256 entriesin each of four sets, each entry consisting of one hexlet of data (128bits), one bit of parity, and one spare bit. The additional 8 entries ineach of four sets hold the LOC tags, with 128 bits per entry for ⅛ ofthe total cache, using 512 bytes per data memory and 4K bytes total.

There are 128 cache blocks per set, or 512 cache blocks total. Themaximum capacity of the LOC is 128 k bytes. Used as a cache, the LOC ispartitioned into 4 sets, each 32 k bytes. Physically, the LOC ispartitioned into 8 interleaved physical blocks, each holding 16 k bytes.

The physical address pa_(63..0) is partitioned as below into a 52 to 54bit tag (three to five bits are duplicated from the following field toaccommodate use of portion of the cache as niche), 8-bit address to thememory bank (7 bits are physical address (pa), 1 bit is virtual address(v)), 3 bit memory bank select (bn), and 4-bit byte address (bt). Allaccess to the LOC are in units of 128 bits (hexlets), so the 4-bit byteaddress (bt) does not apply here. The shaded field (pa,v) is translatedvia n1 to a cache identifier (ci) and set identifier (si) and presentedto the LOC as the LOC address to LOC bank bn.

The LOC tag consists of 64 bits of information, including a 52 to 54-bittag and other cache state information. Only one MTB entry at a time maycontain a LOC tag.

With 256 byte cache lines, there are 512 cache blocks. At 64 bits pertag, the cache tags require 4 k bytes of storage. This storage isadjacent to the LOC data memory itself, using physicaladdresses=1024..1055. Alternatively (see detailed Description below),physical addresses=0..31 may be used.

The format of a LOC tag entry is shown below.

The meaning of the fields are given by the following table:

name size meaning tag 52 physical address tag da 1 detail access (orphysical address bit 11) vs 1 victim select (or physical address bit 10)mesi 2 coherency: modified (3), exclusive (2), shared (1), invalid (0)tv 8 triclet valid (1) or invalid (0)

To access the LOC, a global address is supplied to the Micro-Tag Buffer(MTB), which associatively looks up the global address into a tableholding a subset of the LOC tags. In particular, each MTB table entrycontains the cache index derived from physical address bits 14..8, ci,(7 bits) and set identifier, si, (2 bits) required to access the LOCdata. Each MTB table entry also contains the protection information ofthe LOC tag.

With an MTB hit, protection information is supplied from the MTB. TheMTB supplies the resulting cache index (ci, from the MTB), setidentifier, si, (2 bits) and virtual address (bit 7, v, from the LA),which are applied to the LOC data bank selected from bits 6..4 of theLA. FIG. 115 shows, the address presented to LOC data bank bn.

With an MTB miss, the GTB (described below) is referenced to obtain aphysical address and protection information.

To select the cache line, a 7-bit niche limit register n1 is comparedagainst the value of pa_(14..8) from the GTB. If pa_(14..8)<n1, a 7-bitaddress modifier register am is inclusive-or'ed against pa_(14..8),producing a cache index, ci. Otherwise, pa_(14..8) is used as ci. Cachelines 0..n1−1, and cache tags 0..n1−1, are available for use as nichememory. Cache lines n1..127 and cache tags n1..127 are used as LOC.

-   -   ci←(pa_(14..8)<n1)?(pa_(14..8)∥am):pa_(14..8)

The address modifier am is (1^(7-log(128-n1))∥0^(log(128-n1))).Referring to FIG. 116, bt field specifies the least-significant bit usedfor tag, and is (n1<112)?12:8+log(128−n1):

Values for n1 in the range 113..127 require more than 52 physicaladdress tag bits in the LOC tag and a requisite reduction in LOCfeatures. Note that the presence of bits 14..10 of the physical addressin the LOC tag is a result of the possibility that, with am=64..127, thecache index value ci cannot be relied upon to supply bit 14..8. Bits9..8 can be safely inferred from the cache index value ci, so long as n1is in the range 0..124. When n1 is in the range 113..127, the da bit isused for bit 11 of the physical address, so the Tag detail access bit issuppressed. When n1 is in the range 121..127, the vs bit is used for bit10 of the physical address, so victim selection is performed withoutstate bits in the LOC tag. When n1 is in the range 125..127, the setassociativity is decreased, so that si₁ is used for bit 9 of thephysical address and when n1 is 127, si₀ is used for bit 8 of thephysical address.

Four tags are fetched from the LOC tags and compared against the PA todetermine which of the four sets contain the data. The four tags arecontained in two consecutive banks; they may be simultaneously orindependently fetched. FIG. 117 shows the address presented to LOC databank (ci_(1..0)∥si₁).

Note that the CT architecture Description variable is present in theabove address. CT describes whether dedicated locations exist in the LOCfor tags at the next power-of-two boundary above the LOC data. Theniche-mapping mechanism can provide the storage for the LOC tags, so theexistence of these dedicated tags is optional: If CT=0, addresses at thebeginning of the LOC (0..31 for this implementation) are used for LOCtags, and the 01 value should be adjusted accordingly by software.

The LOC address (ci∥si) uniquely identifies the cache location, and thisLOC address is associatively checked against all MTB entries on changesto the LOC tags, such as by cache block replacement, bus snooping, orsoftware modification. Any matching MTB entries are flushed, even if theMTB entry specifies a different global address—this permits addressaliasing (the use of a physical address with more than one globaladdress.

With an LOC miss, a victim set is selected (LOC victim selection isdescribed below), whose contents, if any sub-block is modified, iswritten to the external memory. A new LOC entry is constructed withaddress and protection information from the GTB, and data fetched fromexternal memory.

The table below shows the contents of LOC data memory banks 0..7 foraddresses 0..2047:

address bank 7 bank 1 bank 0 0 line 0, hexlet 7, set 0 . . . line 0,hexlet 1, set 0 line 0, hexlet 0, set 0 1 line 0, hexlet 15, set 0 line0, hexlet 9, set 0 line 0, hexlet 8, set 0 2 line 0, hexlet 7, set 1line 0, hexlet 1, set 1 line 0, hexlet 0, set 1 3 line 0, hexlet 15, set1 line 0, hexlet 9, set 1 line 0, hexlet 8, set 1 4 line 0, hexlet 7,set 2 line 0, hexlet 1, set 2 line 0, hexlet 0, set 2 5 line 0, hexlet15, set 2 line 0, hexlet 9, set 2 line 0, hexlet 8, set 2 6 line 0,hexlet 7, set 3 line 0, hexlet 1, set 3 line 0, hexlet 0, set 3 7 line0, hexlet 15, set 3 line 0, hexlet 9, set 3 line 0, hexlet 8, set 3 8line 1, hexlet 7, set 0 line 1, hexlet 1, set 0 line 1, hexlet 0, set 09 line 1, hexlet 15, set 0 line 1, hexlet 9, set 0 line 1, hexlet 8, set0 10 line 1, hexlet 7, set 1 line 1, hexlet 1, set 1 line 1, hexlet 0,set 1 11 line 1, hexlet 15, set 1 line 1, hexlet 9, set 1 line 1, hexlet8, set 1 12 line 1, hexlet 7, set 2 line 1, hexlet 1, set 2 line 1,hexlet 0, set 2 13 line 1, hexlet 15, set 2 line 1, hexlet 9, set 2 line1, hexlet 8, set 2 14 line 1, hexlet 7, set 3 line 1, hexlet 1, set 3line 1, hexlet 0, set 3 15 line 1, hexlet 15, set 3 line 1, hexlet 9,set 3 line 1, hexlet 8, set 3 . . . . . . . . . . . . 1016 line 127,hexlet 7, set 0 line 127, hexlet 1, set 0 line 127, hexlet 0, set 0 1017line 127, hexlet 15, set 0 line 127, hexlet 9, set 0 line 127, hexlet 8,set 0 1018 line 127, hexlet 7, set 1 line 127, hexlet 1, set 1 line 127,hexlet 0, set 1 1019 line 127, hexlet 15, set 1 line 127, hexlet 9, set1 line 127, hexlet 8, set 1 1020 line 127, hexlet 7, set 2 line 127,hexlet 1, set 2 line 127, hexlet 0, set 2 1021 line 127, hexlet 15, set2 line 127, hexlet 9, set 2 line 127, hexlet 8, set 2 1022 line 127,hexlet 7, set 3 line 127, hexlet 1, set 3 line 127, hexlet 0, set 3 1023line 127, hexlet 15, set 3 line 127, hexlet 9, set 3 line 127, hexlet 8,set 3 1024 tag line 3, sets 3 and 2 tag line 0, sets 3 and 2 tag line 0,sets 1 and 0 1025 tag line 7, sets 3 and 2 tag line 4, sets 3 and 2 tagline 4, sets 1 and 0 . . . . . . . . . . . . 1055 tag line 127, sets 3and 2 tag line 124, sets 3 and 2 tag line 124, sets 1 and 0 1056reserved reserved reserved . . . . . . . . . . . . 2047 reservedreserved reserved

The following table summarizes the state transitions required by the LOCcache:

cc op mesi v bus op c x mesi v w m notes NC R x x uncached read NC W x xuncached write CD R l x uncached read CD R x 0 uncached read CD R MES 1(hit) CD W l x uncached write CD W x 0 uncached write CD W MES 1uncached write 1 WT/ R l x triclet read 0 x WA WT/ R l x triclet read 10 S 1 WA WT/ R l x triclet read 1 1 E 1 WA WT/ R MES 0 triclet read 0 xinconsistent WA KEN# WT/ R S 0 triclet read 1 0 1 WA WT/ R S 0 tricletread 1 1 1 E->S: extra WA sharing WT/ R E 0 triclet read 1 0 1 WA WT/ RE 0 triclet read 1 1 S 1 shared block WA WT/ R M 0 triclet read 1 0 S 1other WA subblocks M->l WT/ R M 0 triclet read 1 1 1 E->M: extra WAdirty WT/ R MES 1 (hit) WA WT W l x uncached write WT W x 0 uncachedwrite WT W MES 1 uncached write 1 WA W l x triclet read 0 x 1 throwawayread WA W l x triclet read 1 0 S 1 1 1 WA W l x triclet read 1 1 M 1 1WA W MES 0 triclet read 0 x 1 1 inconsistent KEN# WA W S 0 triclet read1 0 S 1 1 1 WA W S 0 triclet read 1 1 M 1 1 WA W S 1 write 0 S 1 1 WA WS 1 write 1 S 1 1 E->S: extra sharing WA W E 0 triclet read 1 0 S 1 1 1WA W E 0 triclet read 1 1 E 1 1 1 WA W E 1 (hit) x M 1 E->M: extra dirtyWA W M 0 triclet read 1 0 M 1 1 1 WA W M 0 triclet read 1 1 M 1 1 WA W M1 (hit) x M 1 cc cache control op operation: R = read W = write mesicurrent mesi state v current tv state bus bus operation op c cachable(triclet) result x exclusive result mesi new mesi state v new tv state wcacheable write after read m merge store data with cache line data notesother notes on transition

Definition

def data,tda ← LevelOneCacheAccess(pa,size,lda,gda,cc,op,wd) as  //cache index  am ← (1^(7-log(128-nl)) || 0^(log(128-nl))  ci ←(pa_(14..8)<nl) ? (pa_(14..8)||am) : pa_(14..8)  bt ← (nl≦112) ? 12 :8+log(128-nl)  // fetch lags for all four sets  tag 10 ← ReadPhysial(0xFFFFFFFF00000000_(63..19)||CT||0⁵||ci||0¹||0⁴,128) Tag[0] ← tag10_(53..0)  Tag[1] ← tag10_(127.64)  tag32 ← DReadPhysical(0xFFFFFFFF00000000_(63..19)||CT||0⁵||ci||1¹||0⁴,128) Tag[2] ← tag32_(63..0)  Tag[3] ← tag32_(127..64)  vsc ←(Tag[3]₁₀||Tag[2]₁₀) {circumflex over ( )} (Tag[1]₁₀ || Tag[0]₁₀)  //look for matching tag  si ← MISS  for i ← 0 to 3   if (Tag[i]_(63..10)|| i_(1..0) || 0⁷)_(63..bt) = pa_(63..bt) then    si ← i   endif  endfor // detail access checking on MISS  if (si = MISS) and (lda ≠ gda) then  if gda then    PerformAccessDetail(AccessDetailRequiredByGlobalTB)  else    PerformAccessDetail(AccessDetailRequiredByLocalTB)   endif endif  // if no matching tag or invalid MESI or no sub-block, perform cacheable read/write  else   nvsc ← vsc   tda ← (bt>11) ? Tag[si]₁₁ : 0  if al then    sm ← Tag[si]_(7..1+pa) _(7..5) || 1¹ || Tag[si]_(pa)_(7..5) -1..0   endif  endif  // write new data into cache and updatevictim selection and  other tag fields  if al then   if op=R then   mesi ← xen ? E : S   else    mesi ← xen ? M : I TODO   endif   casebt of    12:     Tag[si] ← pa_(63..bt) || tda || Tag[si{circumflex over( )}2]₁₀ {circumflex over ( )} nvsc_(si) ₀ || mesi || sm    Tag[si{circumflex over ( )}1]₁₀ ← Tag[si{circumflex over ( )}3]₁₀{circumflex over ( )} nvsc_(1{circumflex over ( )}si) ₀    11:    Tag[si] ← pa_(63..bt) || Tag[si{circumflex over ( )}2]₁₀ {circumflexover ( )} nvsc_(si) ₀ || mesi || sm     Tag[si{circumflex over ( )}1]₁₀← Tag[si{circumflex over ( )}3]₁₀ {circumflex over ( )}nvsc_(1{circumflex over ( )}si) ₀    10:     Tag[si] ← pa63..bt || mesi|| sm   endcase   dt ← 1   nca ←0xFFFFFFFF00000000_(63..19)||0||ci||si||pa_(7..5)||0⁴  WritePhysical(nca, 256, data)  endif  // retrieve data from cache  if~bd then   nca ← 0xFFFFFFFF00000000_(63..19)||0||ci||si||pa_(7..5)||0⁴  data ← ReadPhysical(nca, 128)  endif  // write data into cache  if(op=W) and bd and al then   nca ←0xFFFFFFFF00000000_(63..19)||0||ci||si||pa_(7..5)||0⁴   data ←ReadPhysical(nca, 128)   mdata ← data_(127..8*(size+pa3..0)) ||wd_(8*(size+pa3..0)-1..8*pa3.0) ||   data_(8*pa3..0..0)  WritePhysical(nca, 128, mdata)  endif  // prefetch into cache  ifal=bd and (cc=PF or cc=LS) then      nca ←0xFFFFFFFF00000000_(63..19)||0||ci||si||i_(2..0)||0⁴     WritePhysical(nca, 256, data)      Tag[si]_(i) ← 1      dt ← 1    else      af ← 1     endif    endif   endfor  endif  // cache tagwriteback if dirty  if dt then   nt ← Tag[si₁||1¹)|| Tag[si₁||0¹)  WritePhysical(0xFFFFFFFF00000000_(63.19)||CT||0⁵||ci||si₁||0⁴, 128,nt)  endif enddef

Physical Address

The LOC data memory banks are accessed implicitly by cached memoryaccesses to any physical memory location as shown above. The LOC datamemory banks are also accessed explicitly by uncached memory accesses toparticular physical address ranges. The address mapping of these rangesis designed to facilitate use of a contiguous portion of the LOC cacheas niche memory.

The physical address of a LOC hexlet for LOC address ba, bank bn, byte bis shown in FIG. 118.

Within the explicit LOC data range, starting from a physical address,pa_(17..0), FIG. 119 shows the LOC address (pa_(17..7)) presented to LOCdata bank (pa_(6..4)).

The table below shows the LOC data memory bank and address referenced bybyte address offsets in the explicit LOC data range. Note that thismapping includes the addresses use for LOC tags.

Byte offset    0 bank 0, address 0   16 bank 1, address 0   32 bank 2,address 0   48 bank 3, address 0   64 bank 4, address 0   80 bank 5,address 0   96 bank 6, address 0   112 bank 7, address 0   128 bank 0,address 1   144 bank 1, address 1   160 bank 2, address 1   176 bank 3,address 1   192 bank 4, address 1   208 bank 5, address 1   224 bank 6,address 1   240 bank 7, address 1 . . . . . . 262016 bank 0, address2047 262032 bank 1, address 2047 262048 bank 2, address 2047 262064 bank3, address 2047 262080 bank 4, address 2047 262096 bank 5, address 2047262112 bank 6, address 2047 262128 bank 7, address 2047

Definition

  def data ← AccessPhysicalLOC(pa,op,wd) as  bank ← pa_(6..4)  addr ←pa_(17..7)  case op of   R:    rd ← LOCArray[bank][addr]    crc ←LOCRedundancy[bank]    data ← (crc and rd_(130..2)) or (~crc andrd_(128..0))    p[0] ← 0    for i ← 0 to 128 by 1     p[I+1] ← p[i]{circumflex over ( )} data_(i)    endfor    if ControlRegister₆₁ and(p[129] ≠ 1) then     raise CacheError    endif   W:    p[0] ← 0    forI ← 0 to 127 by 1     p[I+1] ← p[i] {circumflex over ( )} wd_(i)   endfor    wd₁₂₈ ← ~p[128]    crc ← LOCRedundancy[bank]    rdata ←(crc_(126..0) and wd_(126..0)) or (~crc_(126..0) and wd_(128..2))   LOCArray[bank][addr] ← wd_(128..127) || rdata || wd_(1..0)  endcaseenddef

Level One Cache Stress Control

LOC cells may be fabricated with marginal parameters, for which changesin clock timing or power supply voltage may cause these LOC cells tofail or pass. When testing the LOC while the part is in a normal circuitenvironment, rather than a special test environment with changeablepower supply levels, cells with marginal parameters may not reliablyfail testing.

To combat this problem, two bits of the control register, LOC stress,may be set to stress the circuit environment while testing. Under normaloperation, these bits are cleared (00), while during stress testing, oneor more of these bits are set (01, 10, 11). Self-testing should beperformed in each of the environment settings, and the detected failurescombined together to produce a reliable test for cells with marginalparameters.

Level One Cache Redundancy

The LOC contains facilities that can be used to avoid minor defects inthe LOC data array.

Each LOC bank has three additional bits of data storage for each 128bits of memory data (for a total of 131 bits). One of these bits is usedto retain odd parity over the 128 bits of memory data, and the other twobits are spare, which can be pressed into service by setting a non-zerovalue in the LOC redundancy control register for that bank.

Each row of a LOC bank contains 131 bits: 128 bits of memory data, onebit for parity, and two spare bits:

LOC redundancy control has 129 bits:

Each bit set in the control word causes the corresponding data bit to beselected from a bit address increased by two:

output←(data and ˜control) or ((spare₀∥p∥data_(127..2)) and control)

-   -   parity←(p and ˜pc) or (spare₁ and pc)

The LOC redundancy control register has 129 bits, but is written with a128-bit value. To set the pc bit in the LOC redundancy control, a valueis written to the control with either bit 124 set (1) or bit 126 set(1). To set bit 124 of the LOC redundancy control, a value is written tothe control with both bit 124 set (1) and 126 set (1). When the LOCredundancy control register is read, the process is reversed byselecting the pc bit instead of control bit 124 for the value of bit 124if control bit 126 is zero (0).

This system can remove one defective column at an even bit position andone defective column at an odd bit position within each LOC block. Foreach defective column location, x, LOC control bit must be set at bitsx, x+2, x+4, x+6, . . . . If the defective column is in the paritylocation (bit 128), then set bit 124 only. The following table definesthe control bits for parity, bit 126 and bit 124: (other control bitsare same as values written)

value₁₂₆ value₁₂₄ pc control₁₂₆ control₁₂₄ 0 0 0 0 0 0 1 1 0 0 1 0 1 1 01 1 1 1 1

Physical Address

The LOC redundancy controls are accessed explicitly by uncached memoryaccesses to particular physical address ranges.

The physical address of a LOC redundancy control for LOC bank bn, byte bis:

Definition

  def data ← AccessPhysicalLOCRedundancy(pa,op,wd) as  bank ← pa_(6..4) case op of   R:    rd ← LOCRedundancy[bank]    data ←rd_(127..125)||(rd₁₂₆ ? rd₁₂₄ : rd₁₂₈)||rd_(123..0)   W:    rd ← (wd₁₂₆or wd₁₂₄)||wd_(127..125)||(wd₁₂₆ and wd₁₂₄)||wd_(123..0)   LOCRedundancy[bank] ← rd  endcase enddef

Memory Attributes

Fields in the LTB, GTB and cache tag control various attributes of thememory access in the specified region of memory. These include thecontrol of cache consultation, updating, allocation, prefetching,coherence, ordering, victim selection, detail access, and cacheprefetching.

Cache Control

The cache may be used in one of five ways, depending on a three-bitcache control field 20 (cc) in the LTB and GTB. The cache control fieldmay be set to one of seven states: NC, CD, WT, WA, PF, SS, and LS:

read write read/write State consult allocate update allocate victimprefetch No Cache 0 No No No No No No Cache 1 Yes No Yes No No NoDisable Write 2 Yes Yes Yes No No No Through reserved 3 Write 4 Yes YesYes Yes No No Allocate PreFetch 5 Yes Yes Yes Yes No Yes SubStream 6 YesYes Yes Yes Yes No LineStrearn 7 Yes Yes Yes Yes Yes Yes

The Zeus processor controls cc as an attribute in the LTB and GTB, thussoftware may set this attribute for certain address ranges and clear itfor others. A three-bit field indicates the choice of caching, accordingto the table above. The maximum of the three-bit cache control field(cc) values of the LTB and GTB indicates the choice of caching,according to the table above.

No Cache

No Cache (NC) is an attribute that can be set on a LTB or GTBtranslation region to indicate that the cache is to be not to beconsulted. No changes to the cache state result from reads or writeswith this attribute set, (except for accesses that directly address thecache via memory-mapped region).

Cache Disable

Cache Disable (CD) is an attribute that can be set on a LTB or GTBtranslation region to indicate that the cache is to be consulted andupdated for cache lines which are already present, but no new cachelines or sub-blocks are to be allocated when the cache does not alreadycontain the addressed memory contents.

The “Socket 7” bus also provides a mechanism for supporting chip sets todecide on each access whether data is to be cached, using the CACHE# andKEN# signals. Using these signals, external hardware may cause a regionselected as WT, WA or PF to be treated as CD. This mechanism is onlyactive on the first such access to a memory region if caching isenabled, as the cache may satisfy subsequent references without a bustransaction.

Write Through

Write Through (WT) is an attribute that can be set on a LTB or GTBtranslation region to indicate that the writes to the cache must alsoimmediately update backing memory. Reads to addressed memory that is notpresent in the cache cause cache lines or sub-blocks to be allocated.Writes to addressed memory that is not present in the cache does notmodify cache state.

The “Socket 7” bus also provides a mechanism for supporting chip sets todecide on each access whether data is to be written through, using thePWT and WB/WT# signals. Using these signals, external hardware may causea region selected as WA or PF to be treated as WT. This mechanism isonly active on the first write to each region of memory; as onsubsequent references, if the cache line is in the Exclusive or Modifiedstate and writeback caching is enabled on the first reference, nosubsequent bus operation occurs, at least until the cache line isflushed.

Write Allocate

Write allocate (WA) is an attribute that can be set of a LTB or GTBtranslation region to indicate that the processor is to allocate amemory block to the cache when the data is not previously present in thecache and the operation to be performed is a store. Reads to addressedmemory that is not present in the cache cause cache lines or sub-blocksto be allocated. For cacheable data, write allocate is generally thepreferred policy, as allocating the data to the cache reduces furtherbus traffic for subsequent references (loads or stores) or the data.Write allocate never occurs for data which is not cached. A writeallocate brings in the data immediately into the Modified state.

Other “Socket 7” processors have the ability to inhibit write allocateto cached locations under certain conditions, related by the addressrange. K6, for example, can inhibit write allocate in the range of 15-16Mbyte, or for all addresses above a configurable limit with 4 Mbytegranularity. Pentium has the ability to label address ranges over whichwrite allocate can be inhibited.

PreFetch

Prefetch (PF) is an attribute that can be set on a LTB or GTBtranslation region to indicate that increased prefetching is appropriatefor references in this region. Each program fetch, load or store to acache line that or does not already contain all the sub-blocks causes aprefetch allocation of the remaining sub-blocks. Cache misses causeallocation of the requested sub-block and prefetch allocation of theremaining sub-blocks. Prefetching does not necessarily fill in theentire cache line, as prefetch memory references are performed at alower priority to other cache and memory reference traffic. A limitednumber of prefetches (as low as one in the initial implementation) canbe queued; the older prefetch requests are terminated as new ones arecreated.

In other respects, the PF attribute is handled in the manner of the WAattribute. Prefetching is considered an implementation-dependentfeature, and an implementation may choose to implement region with thePF attribute exactly as with the WA attribute.

Implementations may perform even more aggressive prefetching in futureversions. Data may be prefetched into the cache in regions that arecacheable, as a result of program fetches, loads or stores to nearbyaddresses. Prefetches may extend beyond the cache line associated withthe nearby address. Prefetches shall not occur beyond the reach of theGTB entry associated with the nearby address. Prefetching is terminatedif an attempted cache fill results in a bus response that is notcacheable. Prefetches are implementation-dependent behavior, and suchbehavior may vary as a result of other memory references or other busactivity.

SubStream

SubStream (SS) is an attribute that can be set on a LTB or GTBtranslation region to indicate that references in this region are to beselected as the next victim on a cache miss. In particular, cachemisses, which normally place the cache line in the last-to-be-victimstate, instead place the cache line in the first-to-be-victim state,except relative to cache lines in the I state.

In other respects, the SS attribute is handled in the manner of the WAattribute. SubStream is considered an implementation-dependent feature,and an implementation may choose to implement region with the SSattribute exactly as with the WA attribute.

The SubStream attribute is appropriate for regions which are large datastructures in which the processor is likely to reference the memory datajust once or a small number of times, but for which the cache permitsthe data to be fetched using burst transfers. By making it a priorityfor victimization, these references are less likely to interfere withcaching of data for which the cache performs a longer-term storagefunction.

LineStream

LineStream (LS) is an attribute that can be set on a LTB or GTBtranslation region to indicate that references in this region are to beselected as the next victim on a cache miss, and to enable prefetching.In particular, cache misses, which normally place the cache line in thelast-to-be-victim state, instead place the cache line in thefirst-to-be-victim state, except relative to cache lines in the I state.

In other respects, the LS attribute is handled in the manner of the PFattribute. LineStream is considered an implementation-dependent feature,and an implementation may choose to implement region with the SSattribute exactly as with the PF or WA attributes.

Like the SubStream attribute, the LineStream attribute is particularlyappropriate for regions for which large data structures are used insequential fashion. By prefetching the entire cache line, memory trafficis performed as large sequential bursts of at least 256 bytes,maximizing the available bus utilization.

Cache Coherence

Cache coherency is maintained by using MESI protocols, for which eachcache line (256 bytes) the cache data is kept in one of four states: M,E, S, I:

State this Cache data other Cache data Memory data Modified 3 Data isheld No data is present The contents of exclusively in in other caches.main memory are this cache. now invalid. Exclusive 2 Data is held Nodata is present Data is the same exclusively in in other caches. as thecontents of this cache. main memory Shared 1 Data is held in Data ispossibly in Data is the same this cache, other caches. as the contentsof and possibly main memory. others. Invalid 0 No data for this Data ispossibly in Data is possibly location is other caches. present in mainpresent in memory. the cache.

The state is contained in the mesi field of the cache tag.

In addition, because the “Socket 7” bus performs block transfers andcache coherency actions on triclet (32 byte) blocks, each cache linealso maintains 8 bits of triclet valid (tv) state. Each bit of tvcorresponds to a triclet sub-block of the cache line; bit 0 for bytes0..31, bit 1 for bytes 32..63, bit 2 for bytes 64..95, etc. If the tvbit is zero (0), the coherence state for that triclet is I, no matterwhat the value of the mesi field. If the tv bit is one (1), thecoherence state is defined by the mesi field. If all the tv bits arecleared (0), the mesi field must also be cleared, indicating an invalidcache line.

Cache coherency activity generally follows the protocols defined by the“Socket 7” bus, as defined by Pentium and K6-2 documentation. However,because the coherence state of a cache line is represented in only 10bits per 256 bytes (1.25 bits per triclet), a few state transitions aredefined differently. The differences are a direct result of attempts toset triclets within a cache line to different MES states that cannot berepresented. The data structure allows any triclet to be changed to theI state, so state transitions in this direction match the Pentiumprocessor exactly.

On the Pentium processor, for a cache line in the M state, an externalbus Inquiry cycle that does not require invalidation (INV=0) places thecache line in the S state. On the Zeus processor, if no other triclet inthe cache line is valid, the mesi field is changed to S. If othertriclets in the cache line are valid, the mesi field is left unchanged,and the tv bit for this triclet is turned off, effectively changing itto the I state.

On the Pentium processor, for a cache line in the E state, an externalbus Inquiry cycle that does not require invalidation (INV=0) places thecache line in the S state. On the Zeus processor, the mesi field ischanged to S. If other triclets in the cache line are valid, the MESIstate is effectively changed to the S state for these other triclets.

On the Pentium processor, for a cache line in the S state, an internalstore operation causes a write-through cycle and a transition to the Estate. On the Zeus processor, the mesi field is changed to E. Othertriclets in the cache line are invalidated by clearing the tv bits; theMESI state is effectively changed to the I state for these othertriclets.

When allocating data into the cache due to a store operation, data isbrought immediately into the Modified state, setting the mesi field toM. If the previous mesi field is S, other triclets which are valid areinvalidated by clearing the tv bits. If the previous mesi field is E,other triclets are kept valid and therefore changed to the M state.

When allocating data into the cache due to a load operation, data isbrought into the Shared state, if another processor reports that thedata is present in its cache or the mesi field is already set to S, theExclusive state, if no processor reports that the data is present in itscache and the mesi field is currently E or I, or the Modified state ifthe mesi field is already set to M. The determination is performed bydriving PWT low and checking whether WB/WT# is sampled high; if so theline is brought into the Exclusive state. (See page 202 (184) of theK6-2 documentation).

Strong Ordering

Strong ordering (so) is an attribute which permits certain memoryregions to be operated with strong ordering, in which all memoryoperations are performed exactly in the order specified by the programand others to be operated with weak ordering, in which some memoryoperations may be performed out of program order.

The Zeus processor controls strong ordering as an attribute in the LTBand GTB, thus software may set this attribute for certain address rangesand clear it for others. A one bit field indicates the choice of accessordering. A one (1) bit indicates strong ordering, while a zero (0) bitindicates weak ordering.

With weak ordering, the memory system may retain store operations in astore buffer indefinitely for later storage into the memory system, oruntil a synchronization operation to any address performed by the threadthat issued the store operation forces the store to occur. Loadoperations may be performed in any order, subject to requirements thatthey be performed logically subsequent to prior store operations to thesame address, and subsequent to prior synchronization operations to anyaddress. Under weak ordering it is permitted to forward results from aretained store operation to a future load operation to the same address.Operations are considered to be to the same address when any bytes ofthe operation are in common. Weak ordering is usually appropriate forconventional memory regions, which are side-effect free.

With strong ordering, the memory system must perform load and storeoperations in the order specified. In particular, strong-ordered loadoperations are performed in the order specified, and all load operations(whether weak or strong) must be delayed until all previousstrong-ordered store operations have been performed, which can have asignificant performance impact. Strong ordering is often required formemory-mapped I/O regions, where store operations may have a side-effecton the value returned by loads to other addresses. Note that Zeus hasmemory-mapped I/O, such as the TB, for which the use of strong orderingis essential to proper operation of the virtual memory system.

The EWBE# signal in “Socket 7” is of importance in maintaining strongordering. When a write is performed with the signal inactive, no furtherwrites to E or M state lines may occur until the signal becomes active.Further details are given in Pentium documentation (K6-2 documentationmay not apply to this signal.)

Victim Selection

One bit of the cache tag, the vs bit, controls the selection of whichset of the four sets at a cache address should next be chosen as avictim for cache line replacement. Victim selectrion (vs) is anattribute associated with LOC cache blocks. No vs bits are present inthe LTB or GTB.

There are two hexlets of tag information for a cache line, andreplacement of a set requires writing only one hexlet. To updatepriority information for victim selection by writing only one hexlet,information in each hexlet is combined by an exclusive-or. It is thenature of the exclusive-or function that altering either of the twohexlets can change the priority information.

Full Victim Selection Ordering for Four Sets

There are 4*3*2*1=24 possible orderings of the four sets, which can becompletely encoded in as few as 5 bits: 2 bits to indicate highestpriority, 2 bits for second-highest priority, 1 bit for third-highestpriority, and 0 bits for lowest priority. Dividing this up per set andduplicating per hexlet with the exclusive-or scheme above requires threebits per set, which suggests simply keeping track of the three-highestpriority sets with 2 bits each, using 6 bits total and three bits perset.

Specifically, vs bits from the four sets are combined to produce a 6-bitvalue:

-   -   vsc←(vs[3]∥vs[2])̂(vs[1]∥vs[0])

The highest priority for replacement is set vsc_(1..0), second highestpriority is set VSC_(3..2), third highest priority is set vsc_(5..4),and lowest priority is vsc_(5..4)̂vsc_(3..2)̂vsc_(1..0). When thehighest priority set is replaced, it becomes the new lowest priority andthe others are moved up, computing a new vsc by:

-   -   vsc←vsc_(5..4)̂vsc_(3..2)̂vsc_(1..0)∥vsc_(5..2)

When replacing set vsc for a LineStream or SubStream replacement, thepriority for 20 replacement is unchanged, unless another set containsthe invalid MESI state, computing a new vsc by:

-   -   vsc←mesi[vsc_(5..4)̂vsc_(3..2)̂vsc_(1..0)]=I)?vsc_(5..4)̂vsc_(3..2)̂vsc_(1..0)∥vsc_(5..2);        -   (mesi[vsc_(5..4)]=I)?vsc_(1..0)∥vsc_(5..2);        -   (mesi[vsc_(3..2)]=I)?vsc_(5..4)∥vsc_(3..2):            -   vsc

Cache flushing and invalidations can cause cache lines to be cleared outof sequential order. Flushing or invalidating a cache line moves thatset to highest priority. If that set is already highest priority, thevsc is unchanged. If the set was second or third highest or lowestpriority, the vsc is changed to move that set to highest priority,moving the others down.

-   -   vsc←((fs=vsc_(1..0) or fs=vsc_(3..2))?vsc_(5..4):        vsc_(3..2))∥(fs=vsc_(1..0)?vsc_(1..2): vsc_(1..0))∥fs

When updating the hexlet containing vs[1] and vs[0], the new values ofvs[1] and vs[0] are:

-   -   vs[1]←vs[3]̂vsc_(5..3)    -   vs[0]←vs[2]̂vsc_(2..0)

When updating the hexlet containing vs[3] and vs[2], the new values ofvs[3] and vs[2] are:

-   -   vs[3]←vs[1]̂vsc_(5..3)    -   vs[2]←vs[0]̂vsc_(2..0)

Software must initialize the vs bits to a legal, consistent state. Forexample, to set the priority (highest to lowest) to (0, 1, 2, 3), vscmust be set to Ob100100. There are many legal solutions that yield thisvsc value, such as vs[3]←0, vs[2]←0, vs[1]←4, vs[0]←4.

Simplified Victim Selection Ordering for Four Sets

However, the orderings are simplified in the first Zeus implementation,to reduce the number of vs bits to one per set, keeping a two bit vscstate value:

-   -   vsc←(vs[3]∥vs[2])̂(vs[1]∥vs[0])

The highest priority for replacement is set vsc, second highest priorityis set vsc+1, third highest priority is set vsc+2, and lowest priorityis vsc+3. When the highest priority set is replaced, it becomes the newlowest priority and the others are moved up. Priority is given to setswith invalid MESI state, computing a new vsc by:

-   -   vsc←mesi[vsc+1]=I)?vsc+1:        -   (mesi[vsc+2]=I)?vsc+2:        -   (mesi[vsc+3]=I)?vsc+3:            -   vsc+1

When replacing set vsc for a LineStream or SubStream replacement, thepriority for replacement is unchanged, unless another set contains theinvalid MESI state, computing a new vsc by:

-   -   vsc←mesi[vsc+1]=I)?vsc+1:        -   (mesi[vsc+2]=I)?vsc+2:        -   (mesi[vsc+3]=I)?vsc+3:            -   vsc

Cache flushing and invalidations can cause cache sets to be cleared outof sequential order. If the current highest priority for replacement isa valid set, the flushed or invalidated set is made highest priority forreplacement.

-   -   vsc←(mesi[vsc]=I)?vsc:fs

When updating the hexlet containing vs[1] and vs[0], the new values ofvs[1] and vs[0] are:

-   -   vs[1]←vs[3]̂vsc₁    -   vs[0]←vs[2]̂vsc₀

When updating the hexlet containing vs[3] and vs[2], the new values ofvs[3] and vs[2] are:

-   -   vs[3]←vs[1]̂vsc₁    -   vs[2]←vs[0]̂vsc₀

Software must initialize the vs bits, but any state is legal. Forexample, to set the priority (highest to lowest) to (0, 1, 2, 3), vscmust be set to ObOO. There are many legal solutions that yield this vscvalue, such as vs[3]←0, vs[2]←0, vs[1]←0, vs[0]←0.

Full Victim Selection Ordering for Additional Sets

To extend the full-victim-ordering scheme to eight sets, 3*7=21 bits areneeded, which divided among two tags is 11 bits per tag. This issomewhat generous, as the minimum required is 8*7*6*5*4*3*2*1=40320orderings, which can be represented in as few as 16 bits. Extending thefull-victim-ordering four-set scheme above to represent the first 4priorities in binary, but to use 2 bits for each of the next 3priorities requires 3+3+3+3+2+2+2=18 bits. Representing fewer distinctorderings can further reduce the number of bits used. As an extremeexample, using the simplified scheme above with eight sets requires only3 bits, which divided among two tags is 2 bits per tag.

Victim Selection without LOC Tag Bits

At extreme values of the niche limit register (n1 in the range121..124), the bit normally used to hold the vs bit is usurped for useas a physical address bit. Under these conditions, no vsc value ismaintained per cache line, instead a single, global vsc value is used toselect victims for cache replacement. In this case, the cache consistsof four lines, each with four sets. On each 10 replacement a new sivalus is computed from:

-   -   gvsc←gvsc+1    -   si←gvsĉpa_(11..10)

The algorithm above is designed to utilize all four sets on sequentialaccess to memory.

Victim Selection Encoding LOC Tag Bits

At even more extreme values of the niche limit register (n1 in the range125..127), not only is the bit normally used to hold the vs bit isusurped for use as a physical address bit, but there is a deficit of oneor two physical address bits. In this case, the number of sets can bereduced to encode physical address bits into the victim selection,allowing the choice of set to indicate physical address bits 9 or bits9..8. On each replacement a new vsc valus is computed from:

-   -   gvsc←gvsc+1    -   si←pa₉ (n1=127)?pa₈:gvsĉpa₁₀

The algorithm above is designed to utilize all four sets on sequentialaccess to memory.

Detail Access

Detail access is an attribute which can be set on a cache block ortranslation region to indicate that software needs to be consulted oneach potential access, to determine whether the access should proceed ornot. Setting this attribute causes an exception trap to occur, by whichsoftware can examine the virtual address, by for example, locating datain a table, and if indicated, causes the processor to continueexecution. In continuing, ephemeral state is set upon returning to there-execution of the instruction that prevents the exception trap fromrecurring on this particular re-execution only. The ephemeral state iscleared as soon as the instruction is either completed or subject toanother exception, so DetailAccess exceptions can recur on a subsequentexecution of the same instruction. Alternatively, if the access is notto proceed, execution has been trapped to software at this point, whichcan abort the thread or take other corrective action.

The detail access attribute permits specification of access parametersover memory region on arbitrary byte boundaries. This is important foremulators, which must prevent store access to code which has beentranslated, and for simulating machines which have byte granularity onsegment boundaries. The detail access attribute can also be applied todebuggers, which have the need to set breakpoints on byte-level data, orwhich may use the feature to set code breakpoints on instructionboundaries without altering the program code, enabling breakpoints oncode contained in ROM.

A one bit field indicates the choice of detail access. A one (1) bitindicates detail access, while a zero (0) bit indicates no detailaccess. Detail access is an attribute that can be set by the LTB, theGTB, or a cache tag.

The table below indicates the proper status for all potential values ofthe detail access bits in the LTB, GTB, and Tag:

LTB GTB Tag status 0 0 0 OK - normal 0 0 1 AccessDetailRequiredByTag 0 10 AccessDetailRequiredByGTB 0 1 1 OK - GTB inhibited by Tag 1 0 0AccessDetailRequiredByLTB 1 0 1 OK - LTB inhibited by Tag 1 1 0 OK - LTBinhibited by GTB 1 1 1 AccessDetailRequiredByTag 0 Miss GTBMiss 1 MissAccessDetailRequiredByLTB 0 0 Miss Cache Miss 0 1 MissAccessDetailRequiredByGTB 1 0 Miss AccessDetailRequiredByLTB 1 1 MissCache Miss

The first eight rows show appropriate activities when all three bits areavailable. The detail access attributes for the LTB, GTB, and cache tagwork together to define whether and which kind of detail accessexception trap occurs. Generally, setting a single attribute bit causesan exception, while setting two bits inhibits such exceptions. In thisway, a detail access exception can be narrowed down to cause anexception over a specified region of memory: Software generally will setthe cache tag detail access bit only for regions in which the LTB or GTBalso has a detail access bit set. Because cache activity may flush andrefill cache lines implicity, it is not generally useful to set thecache tag detail access bit alone, but if this occurs, theAccessDetailRequiredByTag exception catches such an attempt.

The next two rows show appropropriate activities on a GTB miss. On a GTBmiss, the detail access bit in the GTB is not present. If the LTBindicates detail access and the GTB misses, theAccessDetailRequiredByLTB exception should be indicated. If softwarecontinues from the AccessDetailRequiredByLTB exception and has notfilled in the GTB, the GTBMiss exception happens next. Since the GTBMissexection is not a continuation exception, a re-execution after theGTBMiss exception can cause a reoccurence of theAccessDetailRequiredByLTB exception. Alternatively, if softwarecontinues from the AccessDetailRequiredByLTB exception and has filled inthe GTB, the AccessDetailRequiredByLTB exception is inhibited for thatreference, no matter what the status of the GTB and Tag detail bits, butthe re-executed instruction is still subject to theAccessDetailRequiredByGTB and AccessDetailRequiredByTag exceptions.

The last four rows show appropriate activities for a cache miss. On acache miss, the detail access bit in the tag is not present. If the LTBor GTB indicates detail access and the cache misses, theAccessDetailRequiredByLTB or AccessDetailRequiredByGTB exception shouldbe indicated. If software continues from these exceptions and has notfilled in the cache, a cache miss happens next. If software continuesfrom the AccessDetailRequiredByLTB or AccessDetailRequiredByGTBexception and has filled in the cache, the previous exception isinhibited for that reference, no matter what the status of the Tagdetail bit, but is still subject to the AccessDetailRequiredByTagexception. When the detail bit must be created from a cache miss, theintial value filled in is zero. Software may set the bit, thus turningoff AccessDetailRequired exceptions per cache line. If the cache line isflushed and refilled, the detail access bit in the cache tag is againreset to zero, and another AccessDetailRequired exception occurs.

Settings of the niche limit parameter to values that require use of theda bit in the LOC tag for retaining the physical address usurp thecapability to set the Tag detail access bit. Under such conditions, theTag detail access bit is effectively always zero (0), so it cannotinhibit AccessDetailRequiredByLTB, inhibit AccessDetailRequiredByGTB, orcause AccessDetailRequiredByTag.

The execution of a Zeus instruction has a reference to one quadlet ofinstruction, which may be subject to the DetailAccess exceptions, and areference to data, which may be unaligned or wide. These unaligned orwide references may cross GTB or cache boundaries, and thus involvemultiple separate reference that are combined together, each of whichmay be subject to the DetailAccess exception. There is sufficientinformation in the DetailAccess exception handler to process unalignedor wide references.

The implementation is free to indicate DetailAccess exceptions forunaligned and wide data references either in combined form, or with eachsub-reference separated. For example, in an unaligned reference thatcrosses a GTB or cache boundary, a DetailAccess exception may beindicated for a portion of the reference. The exception may report thevirtual address and size of the complete reference, and upon continuing,may inhibit reoccurrence of the DetailAccess exception for any portionof the reference. Alternatively, it may report the virtual address andsize of only a reference portion and inhibit reoccurrence of theDetailAccess exception for only that portion of the reference, subjectto another DetailAccess exception occurring for the remaining portion ofthe reference.

Micro Translation Buffer

The Micro Translation Buffer (MTB) is an implementation-dependentstructure which 25 reduces the access traffic to the GTB and the LOCtags. The MTB contains and caches information read from the GTB and LOCtags, and is consulted on each access to the LOC.

To access the LOC, a global address is supplied to the Micro-TranslationBuffer (MTB), which associatively looks up the global address into atable holding a subset of the LOC tags. In addition, each table entrycontains the physical address bits 14..8 (7 bits) and set identifier (2bits) required to access the LOC data.

In the first Zeus implementation, there are two MTB blocks—MTB 0 is usedfor threads 0 and 1, and MTB 1 is used for threads 2 and 3. Per clockcycle, each MTB block can check for 5 4 simultaneous references to theLOC. Each MTB block has 16 entries.

Each MTB entry consists of a bit less than 128 bits of information,including a 56-bit global address tag, 8 bits of privilege levelrequired for read, write, execute, and gateway access, a detail bit, and10 bits of cache state indicating for each triclet (32 bytes) sub-block,the MESI state.

Match

Output

The output of the MTB combines physical address and protectioninformation from the GTB and the referenced cache line.

The meaning of the fields are given by the following table:

name size meaning ga 56 global address gi 9 GTB index ci 7 cache indexsi 2 set index vs 12 victim select da 1 detail access (from cache line)mesi 2 coherency: modified (3), exclusive (2), shared (1), invalid(0) tv8 triclet valid (1) or invalid (0) g 2 minimum privilege required forgateway access x 2 minimum privilege required for execute access w 2minimum privilege required for write access r 2 minimum privilegerequired for read access 0 1 reserved da 1 detail acess (from GTB) so 1strong ordering cc 3 cache control

With an MTB hit, the resulting cache index (14..8 from the MTB, bit 7from the LA) and set identifier (2 bits from the MTB) are applied to theLOC data bank selected from bits 6..4 5 of the GVA. The accessprotection information (pr and rwxg) is supplied from the MTB.

With an MTB (and BTB) miss, a victim entry is selected for replacement.The MTB and BTB are always clean, so the victim entry is discardedwithout a writeback. The GTB (described below) is referenced to obtain aphysical address and protection information. Depending on the accessinformation in the GTB, either the MTB or BTB is filled.

Note that the processing of the physical address pa_(14..8) against theniche limit n1 can be performed on the physical address from the GTB,producing the LOC address, ci. The LOC address, after processing againstthe n1 is placed into the MTB directly, reducing the latency of an MTBhit.

Four tags are fetched from the LOC tags and compared against the PA todetermine which of the four sets contain the data. If one of the foursets contains the correct physical address, a victim MTB entry isselected for replacement, the MTB is filled and the LOC access proceeds.If none of the four sets is a hit, an LOC miss occurs.

MTB miss GTB cam LOC tag MTB fill    MTB victim        LOC miss

The operation of the MTB is largely not visible to software—hardwaremechanisms are responsible for automatically initializing, filling andflushing the MTB. Activity that modifies 10 the GTB or LOC tag state mayrequire that one or more MTB entries are flushed.

A write to the GTBUpdate register that updates a matching entry, a writeto the GTBUpdateFill register, or a direct write to the GTB all flushrelevant entries from the MTB. MTB flushing is accomplished by searchingMTB entries for values that match on the gi field with the GTB entrythat has been modified. Each such matching MTB entry is flushed.

The MTB is kept synchronous with the LOC tags, particularly with respectto MESI state. On an LOC miss or LOC snoop, any changes in MESI stateupdate (or flush) MTB entries which physically match the address. If theMTB may contain less than the full physical address: it is sufficient toretain the LOC physical address (ci∥v∥si).

Block Translation Buffer

Zeus has a per thread “Block Translation Buffer” (BTB). The BTB retainsGTB information for uncached address blocks. The BTB is used in parallelwith the MTB—exactly one of the BTB or MTB may translate a particularreference. When both the BTB and MTB miss, the GTB is consulted, anddepending on the result, the block is filled into either the MTB or BTBas appropriate. In the first Zeus implementation, the BTB has 2 entriesfor each thread.

BTB entries cover any power-of-two granularity, as they retain the sizeinformation from the GTB. BTB entries contain no MESI state, as theyonly contain uncached blocks.

Each BTB entry consists of 128 bits of information, containing the sameinformation in the same format as a GTB entry.

Niche blocks are indicated by GTB information, and correspond to blocksof data that are retained in the LOC and never miss. A special physicaladdress range indicates niche blocks. For this address range, the BTBenables use of the LOC as a niche memory, generating the “set select”address bits from low-order address bits. There is no checking of theLOC tags for consistent use of the LOC as a niche—the n1 field must bepreset by software so that LOC cache replacement never claims the LOCniche space, and only BTB miss and protection bits prevent software fromusing the cache portion of the LOC as niche.

Other address ranges include other on-chip resources, such as businterface registers, the control register and status register, as wellas off-chip memory, accessed through the bus interface. Each of theseregions are accessible as uncached memory.

Program Translation Buffer

Later implementations of Zeus may optionally have a per thread “ProgramTranslation Buffer” (PTB). The PTB retains GTB and LOC cache taginformation. The PTB enables generation of LOC instruction fetching inparallel with load/store fetching. The PTB is updated when instructionfetching crosses a cache line boundary (each 64 instructions instraight-line code). The PTB functions similarly to a one-entry MTB, butcan use the sequential nature of program code fetching to avoid checkingthe 56-bit match. The PTB is flushed at the same time as the MTB.

The initial implementation of Zeus has no PTB—the MTB suffices for thisfunction.

Global Virtual Cache

The initial implementation of Zeus contains cache which is both indexedand tagged by a physical address. Other prototype implementations haveused a global vitual address to index and/or tag an internal cache. Thissection will define the required characteristics of a globalvitually-indexed cache. TODO

Memory Interface

Dedicated hardware mechanisms are provided to fetch data blocks in thelevels zero and one caches, provided that a matching entry can be foundin the MTB or GTB (or if the MMU is disabled). Dedicated hardwaremechanisms are provided to store back data blocks in the level zero andone caches, regardless of the state of the MTB and GTB. When no entry isto be found in the GTB, an exception handler is invoked either togenerate the required information from the virtual address, or to placean entry in the GTB to provide for automatic handling of this and othersimilarly addressed data blocks.

The initial implementation of Zeus accesses the remainder of the memorysystem through the “Socket 7” interface. Via this interface, Zeusaccesses a secondary cache, DRAM memory, external ROM memory, and an I/Osystem The size and presence of the secondary cache and the DRAM memoryarray, and the contents of the external ROM memory and the I/O systemare variables in the processor environment.

Microarchitecture

Each thread has two address generation units, capable of producing twoaligned, or one unaligned load or store operation per cycle.Alternatively, these units may produce a single load or store addressand a branch target address.

Each thread has a LTB, which translates the two addresses into globalvirtual addresses.

Each pair of threads has a MTB, which looks up the four references intothe LOC. The PTB provides for additional references that are programcode fetches.

In parallel with the MTB, these four references are combined with thefour references from the other thread pair and partitioned into even andodd hexlet references. Up to four references are selected for each ofthe even and odd portions of the LZC. One reference for each of theeight banks of the LOC (four are even hexlets; four are odd hexlets) areselected from the eight load/store/branch references and the PTBreferences.

Some references may be directed to both the LZC and LOC, in which casethe LZC hit causes the LOC data to be ignored. An LZC miss which hits inthe MTB is filled from the LOC to the LZC. An LZC miss which misses inthe MTB causes a GTB access and LOC tag access, then an MTB fill and LOCaccess, then an LZC fill.

Priority of access: (highest/lowest) cache dump, cache fill, load,program, store.

Snoop

The “Socket 7” bus requires certain bus accesses to be checked againston-chip caches. On a bus read, the address is checked against theon-chip caches, with accesses aborted when requested data is in aninternal cache in the M state, and the E state, the internal cache ischanged to the S state. On a bus write, data written must update data inon-chip caches. To meet these requirements, physical bus addresses mustbe checked against the LOC tags.

The S7 bus requires that responses to inquire cycles occur with fixedtiming. At least with certain combinations of bus and processor clockrate, inquire cycles will require top priority to meet the inquireresponse timing requirement.

Synchronization operations must take into account bus activity—generallya synchronization operation can only proceed on cached data which is inExclusive or Modified—if cached data in Shared state, ownership must beobtained. Data that is not cached must be accessed using locked buscycles.

Load

Load operations require partitioning into reads that do not cross ahexlet (128 bit) boundary, checking for store conflicts, checking theLZC, checking the LOC, and reading from memory. Execute and Gatewayaccesses are always aligned and since they are smaller than a hexlet, donot cross a hexlet boundary.

Note: S7 processors perform unaligned operations LSB first, MSB last, upto 64 bits at a time. Unaligned 128 bit loads need 3 64-bit operations,LSB, octlet, MSB. Transfers which are smaller than a hexlet but largerthan an octlet are further divided in the S7 bus unit.

Definition

 def data ← LoadMemoryX(ba,la,size,order)   assert (order = L) and ((laand (size/8-1)) = 0) and (size = 32)   hdata ←TranslateAndCacheAccess(ba,la,size,X,0)   data ←hdata_(31+8*(la and 15)..8*(la and 15))  enddef  def data ←LoadMemoryG(ba,la,size,order)   assert (order = L) and ((la and(size/8-1)) = 0) and (size = 64)   hdata ←TranslateAndCacheAccess(ba,la,size,G,0)   data ← hdata_(63+8*(la)_(and 15)..8*(la and 15))  enddef  def data ←LoadMemory(ba,la,size,order)   if (size > 128) then    data0 ←LoadMemory(ba, la,size/2, order)    data1 ← LoadMemory(ba, la+(size/2),size/2, order)    case order of    L:     data ← datal || data0    B:    data ← data0 || data1   endcase  else   bs ← 8*la_(4..0)   be ← bs +size   if be > 128 then    data0 ← LoadMemory(ba, la, 128 − bs, order)   data1 ← LoadMemory(ba, (la_(63..5) + 1) || 0⁴, be − 128, order)   case order of     L:      data ← (data1 || data0)     B:      data ←(data0 || data1)    endcase   else    hdata ←TranslateAndCacheAccess(ba,la,size,R,0)    for i ← 0 to size-8 by 8    j ← bs + ((order=L) ? i : size-8-i)     data_(i+7..i) ←hdata_(j+7..j)    endfor   endif  endif enddef

Store

Store operations requires partitioning into stores less than 128 bitsthat do not cross hexlet boundaries, checking for store conflicts,checking the LZC, checking the LOC, and 30 storing into memory.

  def StoreMemory(ba,la,size,order,data)  bs ← 8*la_(4..0)  be ← bs +size  if be > 128 then   case order of    L:     data0 ←data_(127-bs..0)     data1 ← data_(size-1..128-bs)    B:     data0 ←data_(size-1..be-128)     data1 ← data_(be-129..0)   endcase  StoreMemory(ba, la, 128 - bs, order, data0)   StoreMemory(ba,(la_(63..5) +1) || 0⁴, be - 128, order, data1)  else   for i ← 0 tosize-8 by 8    j ← bs + ((order=L) ? i : size-8-i)    hdata_(j+7..j) ←data_(i+7..i)   endfor   xdata ← TranslateAndCacheAccess(ba, la, size,W, hdata)  end if enddef

Memory

Memory operations require first translating via the LTB and GTB,checking for access exceptions, then accessing the cache.

Definition

def hdata ← TranslateAndCacheAccess(ba,la,size,rwxg,hwdata)  ifControlRegister₆₂ then   case rwxg of     R:       at ← 0     W:      at ← 1     X:       at ← 2     G:       at ← 3   endcase   rw ←(rwxg=W) ? W : R   ga,LocalProtect ←LocalTranslation(th,ba,la,pl)   ifLocalProtect_(9+2*at.. 8+2*at) < pl then    raise AccessDisallowedByLTB  endif   lda ← Local Protect₄   pa,GlobalProtect ←GlobalTranslation(th,ga,pl,lda)   if GlobalProtect_(9+2*at..8+2*at) < plthen    raise AccessDisallowedByGTB   endif   cc ←(LocalProtect_(2..0) > GlobalProtect_(2..0))   ? LocalProtect_(2..0) :GlobalProtect_(2..0)   so ← LocalProtect₃ or GlobalProtect₃   gda ←GlobalProtect₄   hdata,TagProtect ←  LevelOneCacheAccess(pa,size,lda,gda,cc,rw,hwdata)   if (lda{circumflex over ( )} gda {circumflex over ( )} TagProtect) = 1 then   if TagProtect then      PerformAccessDetail(AccessDetailRequiredByTag)    elseif gda then      PerformAccessDetail(AccessDetailRequiredByGlobalTB)    else       PerformAccessDetail(AccessDetailRequiredByLocalTB)      endif   endif   else    case rwxg of      R, X, G:        hdata ←ReadPhysical(la,size)      W:        WritePhysical(la,size,hwdata)   endcase   endif enddef

Rounding and Exceptions

In accordance with one embodiment of the invention, rounding isspecified within the instructions explicitly, to avoid explicit stateregisters for a rounding mode. Similarly, the instructions explicitlyspecify how standard exceptions (invalid operation, division by zero,overflow, underflow and inexact) are to be handled (U.S. Pat. No.5,812,439 describes this “Technique of incorporating floating pointinformation into processor instructions.”).

In this embodiment, when no rounding is explicitly named by theinstruction (default), round to nearest rounding is performed, and allfloating-point exception signals cause the standard-specified defaultresult, rather than a trap. When rounding is explicitly named by theinstruction (N: nearest, Z: zero, F: floor, C: ceiling), the specifiedrounding is performed, and floating-point exception signals other thaninexact cause a floating-point exception trap. When X (exact, orexception) is specified, all floating-point exception signals cause afloating-point exception trap, including inexact. More details regardingrounding and exceptions are described in the “Rounding and Exceptions”section.

This technique assists the Zeus processor in executing floating-pointoperations with greater parallelism. When default rounding and exceptionhandling control is specified in floating-point instructions, Zeus maysafely retire instructions following them, as they are guaranteed not tocause data-dependent exceptions. Similarly, floating-point instructionswith N, Z, F, or C control can be guaranteed not to cause data-dependentexceptions once the operands have been examined to rule out invalidoperations, division by zero, overflow or underflow exceptions. Onlyfloating-point instructions with X control, or when exceptions cannot beruled out with N, Z, F, or C control need to avoid retiring followinginstructions until the final result is generated.

ANSI/IEEE standard 754-1985 specifies information to be given to traphandlers for the five floating-point exceptions. The Zeus architectureproduces a precise exception, (The program counter points to theinstruction that caused the exception and all register state is present)from which all the required information can be produced in software, asall source operand values and the specified operation are available.

ANSI/IEEE standard 754-1985 specifies a set of five “sticky-exception”bits, for recording the occurrence of exceptions that are handled bydefault. The Zeus architecture produces a precise exception forinstructions with N, Z, F, or C control for invalid operation, divisionby zero, overflow or underflow exceptions and with X control for allfloating-point exceptions, from which corresponding sticky-exceptionbits can be set. Execution of the same instruction with default controlwill compute the default result with round-to-nearest rounding. Mostcompound operations not specified by the standard are not available withrounding and exception controls.

Instruction Set

This section describes the instruction set in complete architecturaldetail. Operation codes are numerically defined by their position in thefollowing operation code tables, and are referred to symbolically in thedetailed instruction definitions. Entries that span more than onelocation in the table define the operation code identifier as thesmallest value of all the locations spanned. The value of the symbol canbe calculated from the sum of the legend values to the left and abovethe identifier.

Instructions that have great similarity and identical formats aregrouped together. Starting on a new page, each category of instructionsis named and introduced.

The Operation codes section lists each instruction by mnemonic that isdefined on that page. A textual interpretation of each instruction isshown beside each mnemonic.

The Equivalences section lists additional instructions known toassemblers that are equivalent or special cases of base instructions,again with a textual interpretation of each instruction beside eachmnemonic. Below the list, each equivalent instruction is defined, eitherin terms of a base instruction or another equivalent instruction. Thesymbol between the instruction and the definition has a particularmeaning. If it is an arrow (← or →), it connects two mathematicalyequivalent operations, and the arrow direction indicates which form ispreferred and produced in a reverse assembly. If the symbol is a (

) the form on the left is assembled into the form on the right solelyfor encoding purposes, and the form on the right is otherwise illegal inthe assembler. The parameters in these definitions are formal; the namesare solely for pattern-matching purposes, even though they may besuggestive of a particular meaning.

The Redundancies section lists instructions and operand values that mayalso be performed by other instructions in the instruction set. Thesymbol connecting the two forms is a (

), which indicates that the two forms are mathematically equivalent,both are legal, but the assembler does not transform one into the other.

The Selection section lists instructions and equivalences together in atabular form that highlights the structure of the instruction mnemonics.

The Format section lists. (1) the assembler format, (2) the C intrinsicsformat, (3) the bit-level instruction format, and (4) a definition ofbit-level instruction format fields that are not a one-for-one matchwith named fields in the assembler format.

The Definition section gives a precise definition of each basicinstruction.

The Exceptions section lists exceptions that may be caused by theexecution of the instructions in this category.

Major Operation Codes

All instructions are 32 bits in size, and use the high order 8 bits tospecify a major operation code.

The major field is filled with a value specified by the following table(Blank table entries cause the Reserved Instruction exception tooccur.):

major operation code field values MAJOR 0 32 64 96 128 160 192 224 0ARES BEF16 LI16L SI16L XDEPOSIT EMULXI WMULMATXIL 1 AADDI BEF32 LI16BSI16B GADDI EMILXIU WMULMATXIB 2 AADDI.O BEFS64 LI16AL SI16AL GADDI.OEMULXIM WMULMATXIUL 3 AADDIU.O BEF128 LI16AB SI16AB GADDIU.O EMULXICWMULMATXIUB 4 BLGF16 LI32L SI32L XDEPOSITU EMULADDXI WMULMATXIML 5 ASUBIBLGF32 LI32B SI32B GSUBI EMULADDXIU WMULMATXIMB 6 ASUBI.O BLGF64 LI32ALSI32AL GSUBI.O EMULADDXIM WMULMATXICL 7 ASUBIU.O BLGF128 LI32AB SI32ABGSUBIU.O EMULADDXIC WMULMATXICB 8 ASETEI BLF16 LI64L SI64L GSETEIXWITHDRAW ECONXIL 9 ASETNEI BLF32 LI64B SI64B GSETNEI ECONXIB 10ASETANDEI BLF64 LI64AL SI64AL GSETANDEI ECONXIUL 11 ASETANDNEI BLF128LI64AB SI64AB GSETANDNEI ECONXIUB 12 ASETLI BGEF16 LI128L SI128L GSETLIXWITHDRAWU ECONXIML 13 ASETGEI BGEF32 LI128B SI128B GSETGEI ECONXIMB 14ASETLIU BGEF64 LI128AL SI128AL GSETLIU ECONXICL 15 ASETGEIU BGEF128LI128AB SI128AB GSETGEIU ECONXICB 16 AANDI BE LIU16L SASI64AL GANDIXDEPOSITM ESCALADDF16 WMULMATXL 17 ANANDI BNE LIU16B SASI64AB GNANDIESCALADDF32 WMULMATXB 18 AORI BANDE LIU16AL SCSI64AL GORI ESCALADDF64WMULMATGL 19 ANORI BANDNE LIU16AB SCSI64AB GNORI ESCALADDX WMULMATGB 20AXORI BL LIU32L SMSI64AL GXORI XSWIZZLE EMULG8 21 AMUX BGE LIU32BSMSI64AB GMUX EMULG64 22 BLU LIU32AL SMUXI64AL GBOOLEAN EMULX 23 BGEULIU32AB SMUXI64AB EEXTRACT 24 ACOPYI BVF32 LIU64L GCOPYI XEXTRACTEEXRACTI 25 BNVF32 LIU64B XSELECT8 EEXTRACTIU 26 BIF32 LIU64AL WTABLEL27 BNIF32 LIU64AB G8 E.8 WTABLEB 28 BI LI8 SI8 G16 XSHUFFLE E.16WSWITCHL 29 BLINKI LIU8 G32 XSHIFTI E.32 WSWITCHB 30 BHINTI G64 XSHIFTE.64 WMINORL 31 AMINOR BMINOR LMINOR SMINOR G128 E.128 WMINORB

Minor Operation Codes

For the major operation field values A.M1NOR, B.MINOR, L.MINOR, S.M1NOR,G.8, G.16, G.32, G.64, G.128, XSHIFTI, XSHIFT, E.8, E.16, E.32, E.64,E.128, W.MINOR.L and W.MINOR.B, the lowest-order six bits in theinstruction specify a minor operation code:

The minor field is filled with a value from one of the following tables:

minor operation code field values for A.MINOR A.MINOR 0 8 16 24 32 40 4856 0 AAND ASETE ASETEF ASHLI ASHLIADD 1 AADD AXOR ASETNE ASETLGF 2 AADDOAOR ASETANDE ASETLF ASHLIO 3 AADDUO AANDN ASETANDNE ASETGEF ASHLIUO 4AORN ASETL/LZ ASETEF.X ASHLISUB 5 ASUB AXNOR ASETGE/GEZ ASETLGF.X 6ASUBO ANOR ASETLU/GZ ASETLF.X ASHRI 7 ASUBUO ANAND ASETGEU/LEZ ASETGEF.XASHRIU ACOM

minor operation code field values for B.MINOR B.MINOR 0 8 16 24 32 40 4856 0 B 1 BLINK 2 BHINT 3 BDOWN 4 BGATE 5 BBACK 6 BHALT 7 BBARRIER

minor operation code field values for L.MINOR L.MINOR 0 8 16 24 32 40 4856 0 L16L L64L LU16L LU64L 1 L16B L64B LU16B LU64B 2 L16AL L64AL LU16ALLU64AL 3 L16AB L64AB LU16AB LU64AB 4 L32L L128L LU32L L8 5 L32B L128BLU32B LU8 6 L32AL L128AL LU32AL 7 L32AB L128A6 LU32AB

minor operation code field values for S.MINOR S. MINOR 0 8 16 24 32 4048 56 0 S16L S64L SAS64AL 1 S16B S64B SAS64AB 2 S16AL S64AL SCS64ALSDCS64AL 3 S16AB S64AB SCS64AB SDCS64AB 4 S32L S128L SMS64AL S8 5 S32BS128B SMS64AB 6 S32AL S128AL SMUX64AL 7 S32AB S128AB SMUX64AB

minor operation cold field values for G.size G.size 0 8 16 24 32 40 4855 0 GSETE GSETEF GADDHN GSUBHN GSHLIADD GADDL 1 GADD GSETNE GSETLGFGADDHZ GSUBHZ GADDLU 2 GADDO GSETANDE GSETLF GADDHF GSUBHF GAAA 3 GADDUOGSETANDNE GSETGEF GADDHC GSUBHC 4 GSETL/LZ GSETEF.X GADDHUN GSUBNUNGSHLISUB GSUBL 5 GSUB GSETGE/GEZ GSETLGF.X GADDHUZ GSUBHUZ GSUBLU 6GSUBO GSETLU/GZ GSETLF.X GADDHUF GSUBHUF GASA 7 GSUBUO GSETGEU/LEZGSETGEF.X GADDHUC GSUBHUC GCOM

minor operation code field values for XSHIFTI XSHIFTI 0 8 16 24 32 40 4856 0 XSHLI XSHLIO XSHRI XEXPANDI XCOMPRESSI 1 2 3 4 XSHLMI XSHLIOUXSHRMI XSHRIU XROTLI XEXPANDIU XROTRI XCOMPRESSIU 5 6 7

minor operation code field values for XSHIFT XSHIFT 0 8 16 24 32 40 4856 0 XSHL XSHLO XSHR XEXPAND XCOMPRESS 1 2 3 4 XSHLM XSHLOU XSHRM XSHRUXROTL XEXPANDU XROTR XCOMPRESSU 5 6 7

minor operation code field values for E.size E.size 0 8 16 24 32 40 4856 0 EMULFN EMULADDFN EADDFN ESUBFN EMUL EMULADD EDIVFN ECON 1 EMULFZEMULADDFZ EADDFZ ESUBFZ EMULU EMULADDU EDIVFZ ECONU 2 EMULFF EMULADDFFEADDFF ESUBFF EMULM EMULADDM EDIVFF ECONM 3 EMULFC EMULADDFC EADDFCESUBFC EMULC EMULADDC EDIVFC ECONC 4 EMULFX EMULADDFX EADDFX ESUBFXEMULSUM EMULSUB EDIVFX EDIV 5 EMULF EMULADDF EADDF ESUBF EMLILSUMUEMULSUBU EDIVF EDIVU 6 EMULCF EMULADDCF ECONFL ECONCFL EMULSUMM EMULSUBMEMULSUMF EMULP 7 EMULSUMCF EMULSUBCF ECONFB ECONCFB EMULSUMC EMULSUBCEMULSUBF EUNARY

minor operation code field values for W.MINOR.L or W.MINOR.BW.MiNOR.order 0 8 16 24 32 40 48 56 0 WMULMAT8 WMULMATM8 1 WMULMAT16WMULMATM16 WMULMATF16 2 WMULMAT32 WMULMATM32 WMULMATF32 3 WMULMAT64WMULMATM64 WMULMATF64 4 WMULMATU8 WMULMATC8 WMULMATP8 5 WMULMATU16WMULMATC16 WMULMATCF16 WMULMATP16 6 WMULMATU32 WMULMATC32 WMULMATCF32WMULMATP32 7 WMULMATU64 WMULMATC64 WMULMATCF64 WMULMATP64

For the major operation field values E.MUL.X.I, E.MUL.X.I.U,E.MUL.X.I.M, E.MULX.I.C, E.MUL.ADD.X.I, E.MUL.ADD.X.I.U,E.MUL.ADD.X.I.M, E.MUL.ADD.X.I.C, E.CON.X.I.L, E.CON.X.I.B,E.CON.X.I.U.L, E.CON.X.I.U.B, E.CON.X.I.M.L, E.CON.X.I.M.B,E.CON.X.I.C.L, E.CON.X.I.C.B, E.EXTRACT.I, E.EXTRACT.I.U,W.MUL.MAT.X.I.U.L, W.MUL.MAT.X.I.U.B, W.MUL.MAT.X.I.M.L.,W.MUL.MAT.X.I.M.B, W.MUL.MAT.X.I.C.L, and W.MUL.MAT.X.I.C.B, another sixbits in the instruction specify a minor operation code, which indicatesoperand size, rounding, and shift amount:

The minor field is filled with a value from the following table: Notethat the shift amount field value shown below is the “sh” value, whichis encoded in an instruction-dependent manner from the immediate fieldin the assembler format.

XI 0 8 16 24 32 40 48 5 

  0 8.F.0 8.N.0 16.F.0 16.N.0 32.F.0 32.N.0 64.F.0 64.N.0 1 8.F.1 8.N.116.F.1 16.N.1 32.F.1 32.N.1 64.F.1 64.N.1 2 8.F.2 8.N.2 16.F.2 16.N.232.F.2 32.N.2 64.F.2 64.N.2 3 8.F.3 8.N.3 16.F.3 16.N.3 32.F.3 32.N.364.F.3 64.N.3 4 8.Z.0 8.C.0 16.Z.0 16.C.0 32.Z.0 32.C.0 64.Z.0 64.C.0 58.Z.1 8.C.1 16.Z.1 16.C.1 32.Z.1 32.C.1 64.Z.1 64.C.1 6 8.Z.2 8.C.216.Z.2 16.C.2 32.Z.2 32.C.2 64.Z.2 64.C.2 7 8.Z.3 8.C.3 16.Z.3 16.C.332.Z.3 32.C.3 64.Z.3 64.C.3

indicates data missing or illegible when filed

For the major operation field values GCOPYI, two bits in the instructionspecify an operand size:

For the major operation field values G.AND.I, G.NAND.I, G.NOR.I, G.OR.I,G.XOR.I, G.ADD.I, G.ADD.I.O, G.ADD.I.UO, G.SET.AND.E.I, G.SET.AND.NE.I,G.SET.E.I, G.SET.GE.I, G.SET.L.I, G.SET.NE.I, G.SET.GE.I.U, G.SET.L.I.U,G.SUB.I, G.SUB.I.O, G.SUB.I.UO, two bits in the instruction specify anoperand size:

The sz field is filled with a value from the following table:

sz size 0 16 1 32 2 64 3 128

For the major operation field values E.8, E.16, E.32, E.64, E.128, withminor operation field value E.UNARY, another six bits in the instructionspecify a unary operation code:

The unary field is filled with a value from the following table:

unary operation code field values for E.UNARY.size E.UNARY 0 8 16 24 3240 48 56 0 ESQRFN ESUMFN ESINKFN EFLOATFN EDEFLATEFN ESUM 1 ESQRFZESUMFZ ESINKFZ EFLOATFZ EDEFLATEFZ ESUMU ESINKFZD 2 ESQRFF ESUMFFESINKFF EFLOATFF EDEFLATEFF ELOGMOST ESINKFFD 3 ESQRFC ESUMFC ESINKFCEFLOATFC EDEFLATEFC ELOGMOSTU ESINKFCD 4 ESQRFX ESUMFX ESINKFX EFLOATFXEDEFLATEFX 5 ESQRF ESUMF ESINKF EFLOATF EDEFLATEF 6 ERSQRESTFX ERECESTFXEABSFX ENEGFX EINFLATEFX ECOPYFX 7 ERSQRESTF ERECESTF EABSF ENEGFEINFLATEF ECOPYF

For the major operation field values A.MINOR and G.MINOR, with minoroperation field values A.COM and G.COM, another six bits in theinstruction specify a comparison operation code:

The compare field is filled with a value from the following table:

compare operation code field values for A.COM.op and G.COM.op.size x.COM0 8 16 24 32 40 48 56 0 xCOME xCOMEF 1 xCOMNE xCOMLGF 2 xCOMANDE xCOMLF3 xCOMANDNE xCOMGEF 4 xCOML xCOMEF.X 5 xCOMGE xCOMLGF.X 6 xCOMLUxCOMLF.X 7 xCOMGEU xCOMGEF.X

General Forms

The general forms of the instructions coded by a major operation codeare one of the following:

The general forms of the instructions coded by major and minor operationcodes are one of the following:

The general form of the instructions coded by major, minor, and unaryoperation codes is the following:

Register rd is either a source register or destination register, orboth. Registers rc and rb are always source registers. Register ra isalways a destination register.

Instruction Fetch Definition

def Thread(th) as  forever do   catch exception    if (EventRegister &EventMask[th]) ≠ 0 then     if ExceptionState=0 then      raiseEventInterrupt     endif    endif    inst ←LoadMemoryX(ProgramCounter,ProgramCounter,32,L)    Instruction(inst)  endcatch   case exception of    EventInterrupt,   ReservedInstruction,    AccessDisallowedByVirtualAddress,   AccessDisallowedByTag,    AccessDisallowedByGlobalTB,   AccessDisallowedByLocalTB,    AccessDetailRequiredByTag,   AccessDetailRequiredByGlobalTB,    AccessDetailRequiredByLocalTB,   MissInGlobalTB,    MissInLocalTB,    FixedPointArithmetic,   FloatingPointArithmetic,    GatewayDisallowed:     caseExceptionState of      0:       PerformException(exception)      1:      PerformException(SecondException)      2:      PerformMachineCheck(ThirdException)     endcase    TakenBranch:    ContinuationState ← (ExceptionState=0) ? 0 : ContinuationState   TakenBranchContinue:     /* nothing */    none, others:    ProgramCounter ← ProgramCounter + 4     ContinuationState ←(ExceptionState=0) ? 0 : ContinuationState   endcase  endforever enddef

Perform Exception Definition

 def PerformException(exception) as   v ← (exception > 7) ? 7 :exception   t ← LoadMemory(ExceptionBase,ExceptionBase+  Thread*128+64+8*v,64,L)   if ExceptionState = 0 then    u ←RegRead(3,128) ∥ RegRead(2,128) ∥ RegRead(1,128) ∥    RegRead(0,128)   StoreMemory(ExceptionBase,ExceptionBase+Thread*128,512,L,u)   RegWrite(0,64,ProgramCounter_(63..2) ∥ PrivilegeLevel   RegWrite(1,64,ExceptionBase+Thread*128)    RegWrite(2,64,exception)   RegWrite(3,64,FailingAddress)   endif   PrivilegeLevel ← t_(1..0)  ProgramCounter ← t 

 _(3..2) ∥ 0²   case exception of    AccessDetailRequiredByTag,   AccessDetailRequiredByGtobalTB,    AccessDetailReguiredByLocalTB:    ContinuationState ← ContinuationState + 1    others:     /* nothing*/   endcase   ExceptionState ← ExceptionState + 1 enddef

indicates data missing or illegible when filed

Instruction Decode

def Instruction(inst) as  major ← inst_(31..24)  rd ← inst_(23..18)  rc← inst_(17..12)  simm ← rb ← inst_(11..6)  minor ← ra ← inst_(5..0) case major of   A.RES:    AlwaysReserved   A.MINOR:    minor ←inst_(5..0)    case minor of     A.ADD, A.ADD.O, A.ADD.OU, A.AND,A.ANDN, A.NAND, A.NOR,     A.OR, A.ORN, A.XNOR, A.XOR:     Address(minor,rd,rc,rb)     A.COM:      compare ← inst_(11..6)     case compare of       A.COM.E, A.COM.NE, A.COM.AND.E, A.COM.AND.NE,      A.COM.L, A.COM.GE, A.COM.L.U, A.COM.GE.U:       AddressCompare(compare,rd,rc)       others:        raiseReservedInstruction      endcase     A.SUB, A.SUB.O, A.SUB.U.O.     ASET.AND.E, A.SET.AND.NE, A.SET.E, A.SET.NE,     A.SET.L, A.SET.GE,A.SET.L.U, A.SET,GE.U,      Address Reversed(minor,rd,rc,rb)    A.SHL.I.ADD.A.SHL.I.ADD+3:     AddressShiftLeftImmediateAdd(inst_(1..0),rd,rc,rb)    A.SHL.I.SUB..A.SHL.I.SUB+3:     AddressShiftLeftImmediateSubtract(inst_(1..0),rd,rc,rb)    A.SHL.I, A.SHL.I.O, A.SHL.I.U.O, A.SHR.I, A.SHR.I.U, A.ROTR.I:     AddressShiftImmediate(minor,rd,rc,simm)     others:      raiseReservedInstruction    endcase   A.COPY.I   AddressCopyImmediate(major,rd,inst_(17..0))   A.ADD.I, A.ADD.I.O,A.ADD.I.U.O, A.AND.I, A.OR.I, A.NAND.I, A.NOR.I, AXOR.I:   AddressImmediate(major,rd,rc,inst_(11..0))   A.SET.AND.E.I,A.SET.AND.NE.I, A.SET.E.I, A.SET.NE.I,   A.SET.L.I, E.SET.GE.I,A.SET.LU.I, A.SET.GE.U.I,   A.SUB.I, A.SUB.I.O, A.SUB.I.U.O:   AddressImmediateReversed(major,rd,rc,inst_(11..0))   A.MUX:   AddressTernary(major,rd,rc,rb,ra)   B.MINOR:    case minor of     B:     Branch(rd,rc,rb)     B.BACK:      BranchBack(rd,rc,rb)    B.BARRIER:      BranchBarrier(rd,rc,rb)     B.DOWN:     BranchDown(rd,rc,rb)     B.GATE:      BranchGateway(rd,rc,rb)    B.HALT:      BranchHalt(rd, rc,rb)     B.HINT:     BranchHint(rd,inst_(17..12),simm)     B.LINK:     BranchLink(rd,rc,rb)     others:      raise ReservedInstruction   endcase   BE, BNE, BL, BGE, BLU, BGE.U, BAND.E, BAND.NE:   BranchConditional(major,rd,rc,inst_(11..0))   BHINTI:   BranchHintImmediate(inst_(23..18),inst_(17..12),inst_(11..0))   BI:   BranchImmediate(inst_(23..0))   BLINKI:   BranchImmediateLink(inst_(23..0))    BEF16, BLGF16, BLF16, BGEF16,   BEF32, BLGF32, BLF32, BGEF32,    BEF64, BLGF64, BLF64, BGEF64,   BEF128, BLGF128, BLF128, BGEF128:    BranchConditionalFloatingPoint(major,rd,rc,inst_(11..0))    BIF32,BNIF32, BNVF32, BVF32:    BranchConditionalVisibilityFloatingPoint(major,rd,rc,inst_(11..0))   L.MINOR     case minor of      L16L, LU16L, L32L, LU32L, L64L, LU64L,L128L, L8, LU8,      L16AL, LU16AL, L32AL, LU32AL, L64AL, LU64AL,L128AL,      L16B, LU16B, L32B, LU32B, L64B, LU64B, L128B,      L16AB,LU16AB, L32AB, LU32AB, L64AB, LU64AB, L128AB:       Load(minor,rd,rc,rb)     others:       raise ReservedInstruction     endcase    LI16LLIU16L, LI32L, LIU32L, LI64L, LIU64L, LI128L, LI8, LIU8,    LI16AL,LIU16AL, LI32AL, LIU32AL, LI64AL, LIU64AL, LI128AL,    LI16B, LIU16B,LI32B, LIU32B, LI64B, LIU64B, LI128B,    LI16AB, LIU16AB, LI32AB,LIU32AB, LI64AB, LIU64AB, LI128AB:    LoadImmediate(major,rd,rc,inst_(11..0))    S.MINOR     case minor of     S16L, S32L, S64L, S128L, S8,      S16AL, S32AL, S64AL, S128AL,     SAS64AL, SCS64AL, SMS64AL, SM64AL,      S16B, S32B, S64B, S128B,     S16AB, S32AB, S64AB, S128AB,      SAS64AB, SCS64AB, SMS64AB,SM64AB:       Store(minor,rd,rc,rb)      SDCS64AB, SDCS64AL:      StoreDoubleCompareSwap(minor,rd,rc,rb)      others:       raiseReservedInstruction     endcase    SI16L, SI32L, SI64L, SI128L, SI8,   SI16AL, SI32AL, SI64AL, SI128AL,    SASI64AL, SCSI64AL, SMSI64AL,SMUXI64AL,    SI168, SI32B, SI64B, SI128B,    SI16AB, SI32AB, SI64AB,SI128AB    SASI64AB, SCSI64AB, SMSI64AB, SMUXI64AB:    StoreImmediate(major,rd,rc,inst_(11..0))    G.8, G.16, G.32, G.64,G.128:     minor ← inst_(5..0)     size ← 0 ∥ 1 ∥ 0^(3+maior-G.8)    case minor of      G.ADD, GADD.L, G.ADD.LU, G.ADD.O, G.ADD.OU:      Group(minor,size,rd,rc,rb)      G.ADDHC, G.ADDHF, G.ADDHN,G.ADDHZ,      G.ADDHUC, G.ADDHUF, G.ADDHUN, G.ADDHUZ:      GroupAddHalve(minor,inst_(1..0),size,rd,rc,rb)      G.AAA, G.ASA:      GroupInplace(minor,size,rd,rc,rb)      G.SET.AND.E, G.SET.AND.NE,G.SET.E, G.SET.NE,      G.SET.L, G.SET.GE, G.SET.L.U, G.SET.GE.U:     G.SUB, G.SUB.L, G.SUB.LU, G.SUB.O, G.SUB.U.O:      GroupReversed(minor,size,ra,rb,rc)      G.SET.E.F, G.SET.LG.F,G.SET.GE.F, G.SET.L.F,      G.SET.E.F.X, G.SET.L.G.F.X, G.SET.GE.F.X,G.SET.L.F.X:       GroupReversedFloatingPoint(minor.op,.size,       minor.round, rd, rc, rb)      G.SHL.I.ADD..G.SHL.I.ADD+3,      GroupShiftLeftImmediateAdd(inst_(1..0),size,rd,rc,rb)     G.SHL.I.SUB.,G.SHL.I.SUB+3,      GroupShiftLeftImmediateSubtract(inst_(1..0),size,rd,rc,rb)     G.SUBHC, G.SUBHF, G.SUBHN, G.SUBHZ,      G.SUBHUC, G.SUBHUF,G.SUBHUN, G.SUBHUZ:      GroupSubtractHalve(minor,inst_(1..0),size,rd,rc,rb)      G.COM,      compare ← inst_(11..6)       case compare of        G.COM.E,G.COM,NE, G.COM.AND.E, G.COM.AND.NE,        G.COM.L, G.COM.GE,G.COM.L.U, G.COM.GE.U:         GroupCompare(compare,size,ra,rb)       others:         raise ReservedInstruction       endcase     others:       raise ReservedInstruction     endcase   G.BOOLEAN..G.BOOLEAN+1;     GroupBoolean(major,rd,rc,rb,minor)   G.COPY.I...G.COPY.I+1:     size ← 0 ∥ 1 ∥ 0^(4+inst) ^(17..16)    GroupCopyImmediate(major,size,rd,inst_(15..0))    G.AND.I, G.NAND.I,G.NOR.I, G.OR.I, G.XOR.I,    G.ADD.I, G.ADD.I.O, G.ADD.I.U.O:     size ←0 ∥ 1 ∥ 0^(4+inst) ^(11..10)    GroupImmediate(major,size,rd,rc,inst_(9..0))    G.SET.AND.E.I,G.SET.AND.NE.I, G.SET.E.I, G.SET,GE.I, G.SET.L.I,    G.SET.NE.I,G.SET.GE.I.U, G.SET.L.I.U, G.SUB.I, G.SUB.I.O, G.SUB.I.U.O:     size ← 0∥ 1 ∥ 0^(4+inst) ^(11..10)    GroupImmediateReversed(major,size,rd,rc,inst_(9..0))    G.MUX:    GroupTernany(major,rd,rc,rb,ra)    X.SHIFT:     minor ← inst_(5..2)∥ 0²     size ← 0 ∥ 1 ∥ 0^((inst) ²⁴ ^(∥ inst) ^(1..0) ⁾     case minorof      X.EXPAND, X.UEXPAND, X.SHL, X.SHL.O, X.SHL.U.O,      X.ROTR,X.SHR, X.SHR.U,       Crossbar(minor,size,rd,rc,rb)      X.SHL.M,X.SHR.M;       CrossbarInplace(minor,size,rd,rc,rb)      others:      raise ReservedInstruction     endcase    X.EXTRACT:    CrossbarExtract(major,rd,rc,rb,ra)    X.DEPOSIT, X.DEPOSIT.UX.WITHDRAW X.WITHDRAW.U    CrossbarField(major,rd,rc,inst_(11..6),inst_(5..0))    X.DEPOSIT.M:    CrossbarFieldInplace(major,rd,rc,inst_(11..6),inst_(5..0))   X.SHIFT.I:     minor ← inst_(5..0)     case minor_(5..2) ∥ 0² of     X.COMPRESS.I, X.EXPAND.I, X.ROTR.I, X.SHL.I, X.SHL.I.O, X.SHL.I.U,     X.SHR.I, X.COMPRESS.I.U, X.EXPAND.I.U, X.SHR.UI:      CrossbarShortImmediate(minor,rd,rc,simm)      X.SHL.M.I, XSHR.M.I:       CrossbarShortImmediateInplace(minor,rd,rc,simm)     others:       raise ReservedInstruction     endcase   X.SHUFFLE..X.SHUFFLE+1:     CrossbarShuffle(major,rd,rc,rb,simm)   X.SWIZZLE..X.SWIZZLE+3:     CrossbarSwizzle(major,rd,rc,inst_(11..6),inst_(5..0))    X.SELECT.8:    CrossbarTernary(major,rd,rc,rb,ra)    E.8, E.16, E.32, E.64, E.128:    minor ← inst_(5..0)     size ← 0 ∥ 1 ∥ 0^(3+major-E.8)     caseminor of      E.CON., E.CON.U, E.CON.M, E.CON.C,      E.MUL., E.MUL.U,E.MUL.M, E.MUL.C,      E.MUL.SUM, E.MUL.SUM.U, E.MUL.SUM.M, E.MUL.SUM.C,     E.DIV, E.DIV.U, E.MUL.P:       Ensemble(minor,size,ra,rb,rc)     E.CON.F.L, E.CON.F.B, E.CON.C.F.L., E.CON.C.F.B:      EnsembleConvolveFloatingPoint(minor.size,rd,rc,rb)      E.ADD.F.N,E.MUL.C.F.N, E.MUL.F.N, E.DIV.F.N,      E.ADD.F.Z, E.MUL.C.F.Z,E.MUL.F.Z, E.DIV.F.Z,      E.ADD.F.F, E.MUL.C.F.F, E.MUL.F.F, E.DIV.F.F,     E.ADD.F.C, E.MUL.C.F.C, E.MUL.F.C, E.DIV.F.C,      E.ADD.F, EMUL.C.F, E.MUL.F, E.DIV.F,      E.ADD.F.X, E.MUL.C.F.X, E.MUL.F.X,E.DIV.F.X,       EnsembleFloatingPoint(minor.op, major.size,minor.round, rd, rc, rb)      E.MUL.ADD, E.MUL.ADD.U, E.MUL.ADD.M,E.MUL.ADD.C:       EnsembleInplace(minor,size,rd,rc,rb)      E.MUL.SUB,E.MUL.SUB.U, E.MUL.SUB.M, E.MUL.SUB.C:      EnsembleInplaceReversed(minor,size,rd,rc,rb)      E.MUL.SUB.F,E.MUL.SUB.C.F:      EnsembleInplaceReversedFloatingPoint(minor,size,rd,rc,rb)     E.SUB.F.N, E.SUB.F.Z, E.SUB.F.F, E.SUB.F.C, E.SUB.F, E.SUB.F.X:      EnsembleReversedFloatingPoint(minor.op, major.size,       minor.round, rd, rc, rb)      E.UNARY:       case unary of       E.SUM, E.SUMU, ELOG.MOST, E. LOG.MOST.U:        EnsembleUnary(unary,rd,rc)        E.ABS.F, E.ABS.F.X, E.COPY.F,E.COPY.F.X,        E.DEFLATE.F, E.DEFLATE.F.N, E.DEFLATE.F.Z,       E.DEFLATE.F.F, E.DEFLATE.F.C, E.DEFLATE.F.X:        E.FLOAT.F,E.FLOAT.F.N, E.FLOAT.F.Z,        E.FLOAT.F.F, E.FLOAT.F.C, E.FLOAT.F.X:       E.INFLATE.F, E.INFLATE.F.X, E.NEG.F, E.NEG.F.X,       E.RECEST.F, E.RECEST.F.X, E.RSQREST.F, E.RSQREST.F.X,       E.SQR.F, E.SQR.F.N, E.SQR.F.Z, E.SQR.F.F, E.SQR.F.C, E.SQR.F.X:       E SUM.F, E.SUM.F.N, E.SUM.F.Z,        E.SUM.F.F, E.SUM.F.C,E.SUM.F.X:        E.SINK.F, E.SINK.F.Z.D, E.SINK.F.F.D, E.SINK.F.C.D,E.SINK.F.X.D,        E.SINK.F.N, E.SINK.F.Z, E.SINK.F.F, E.SINK.F.C,E.SINK.F.X:         EnsembleUnaryFloatingPoint(unary.op, major.size,         unary,round, rd, rc)        others:         raiseReservedInstruction       endcase      others:       raiseReservedInstruction     endcase    E.CON.X.IL, E.CON.X.IB, E.CON.X.IUL,E.CON.X.IUB,    E.CON X.IML, E.CON.X.IMB, E.CON.X.ICL, E.CON.X.ICB:    size ← 1 ∥ 0^(3+inst) ^(5..4)    EnsembleConvolveExtractImmediate(major,inst_(3..2),size,rd,rc,rb,inst_(1..0))   E.MULX, E.EXTRACT, E.SCAL.ADDX:    EnsembleExtract(majar,rd,rc,rb,ra)    E.EXTRACTI, E.EXTRACTIUE.MULXI, E.MULXIU, E.MULXIM, E.MULXIC:     size ← 1 ∥ 0^(3+inst) _(5..4)    EnsembleExtractImmediate(major,inst_(3..2),size,rd,rc,rb,inst_(1..0))   E.MUL.ADD.X.I, E.MUL.ADD.X.I.U, E.MUL.ADD.X.I.M, E.MUL.ADD.X.I.C:    size ← 1 ∥ 0^(3+inst) _(5..4)    EnsembleExtractImmediateInplace(major,inst_(3..2),size,rd,rc,rb,inst_(1..0))   E.MUL.GAL.8, E.MUL.GAL.64:     size ← 1 ∥ 0^(3+inst) _(26..24)    EnsembleTernary(major,size,rd,rc,rb,ra)    E.MUL.ADD.F16,E.MUL.ADD.F32, E.MUL.ADD.F64, E.MUL.ADD.F128    E.MULSUB.F16,E.MULSUB.F32, E.MULSUB.F64, E.MULSUB.F128,    E.SCAL.ADD.F16,E.SCAL.ADD.F32, E.SCAL.ADD.F64:    EnsembleTernaryFloatingPoint(major,rd,rc,rb,ra)    W.MINOR.B,W.MINOR.L:     case minor of      W.TRANSLATE.8, W.TRANSLATE.16,W.TRANSLATE.32, W.TRANSLATE.64:       size ← 1 ∥ 0^(3+inst) _(5..4)      WideTranslate(major,size,rd,rc,rb)      W.MUL.MAT.8, W.MUL.MAT.16,W.MUL.MAT.32, W.MUL.MAT.64,      W.MUL.MAT.U.8, W.MUL.MAT.U.16,W.MUL.MAT.U.32, W.MUL.MAT.U.64,      W.MUL.MAT.M.8, W.MUL.MAT.M.16,W.MUL.MAT.M.32, W.MUL.MAT.M.64,      W.MUL.MAT.C.8, W.MUL.MAT.C.16,W.MUL.MAT.C.32, W.MUL.MAT.C.64,      W.MUL.MAT.P.8, W.MUL.MAT.P.16,W.MUL.MAT.P.32, W.MUL.MAT.P.64:       size ← 1 ∥ 0^(3+inst) _(5..4)      WideMultiply(major,minor,size,rd,rc,rb)      W.MUL.MAT.F16,W.MUL.MAT.F.32, W.MUL.MAT.F64,      W.MUL.MAT.C.F16, W.MUL.MAT.C.F32,W.MUL.MAT.C.F64:       size ← 1 ∥ 0^(3+inst) _(5..4)      WideFloatingPointMultiply(major,minor,size,rd,rc,rb)       others:       endcase      W.MUL.MAT.X.B, W.MUL.MAT.X.L:      WideExtract(major,ra,rb,rc,rd)      W.MUL.MAT.X.I.B,W.MUL.MAT.X.I.L, W.MUL.MAT.X.I.U.B, W.MUL.MAT.X.I.U.L,     W.MUL.MAT.X.I.M.B, W.MUL.MAT.X.I.M.L, W.MUL.MAT.X.I.C.B,W.MUL.MAT.X.I.C.L:       size ← 1 ∥ 0^(3+inst) _(5..4)      WideExtractImmediate(major,inst_(3..2),size,ra,rb,rc,inst_(1..0))     W.MUL.MAT.G.B, W.MUL.MAT.G.L:      WideMultiplyGalois(major,rd,rc,rb,ra)      W.SWITCH.B, W.SWITCH.L:     WideSwitch(major,rd,rc,rb,ra)      others:       raiseReservedInstruction     endcase    enddef

Group Boolean

In accordance with one embodiment of the invention, these operationstake operands from three registers, perform Boolean operations oncorresponding bits in the operands, and place the concatenated resultsin the third register.

In accordance with one embodiment of the invention, the processorhandles a variety Group Boolean operations. For example, FIG. 31Apresents various Group Boolean instructions. FIGS. 31B and 31Cillustrate an exemplary embodiment of a format and operation codes thatcan be used to perform the Boolean instructions shown in FIG. 31A. Asshown in FIGS. 31B and 31C, in this exemplary embodiment, three valuesare taken from the contents of registers rd, rc and rb. The ih and itfields specify a function of three bits, producing a single bit result.The specified function is evaluated for each bit position, and theresults are catenated and placed in register rd. Register rd is both asource and destination of this instruction.

The function is specified by eight bits, which give the result for eachpossible value of the three source bits in each bit position:

d 1 1 1 1 0 0 0 0 c 1 1 0 0 1 1 0 0 b 1 0 1 0 1 0 1 0 f(d, c, b) f₇ f₆f₅ f₄ f₃ f₂ f₁ f₀

A function can be modified by rearranging the bits of the immediatevalue. The table below shows how rearrangement of immediate value f7; ocan reorder the operands d,c,b for the same function.

operation immediate f(d, c, b) f₇ f₆ f₅ f₄ f₃ f₂ f₁ f₀ f(c, d, b) f₇ f₆f₃ f₂ f₅ f₄ f₁ f₀ f(d, b, c) f₇ f₅ f₆ f₄ f₃ f₁ f₂ f₀ f(b, c, d) f₇ f₃ f₅f₁ f₆ f₂ f₄ f₀ f(c, b, d) f₇ f₅ f₃ f₁ f₆ f₄ f₂ f₀ f(b, d, c) f₇ f₃ f₆ f₂f₅ f₁ f₄ f₀

By using such a rearrangement, an operation of the form: b=f(d,c,b) canbe recoded into a legal form: b=f(b,d,c). For example, the function:b=f(d,c,b)=d?c:b cannot be coded, but the equivalent function: d=c?b:dcan be determined by rearranging the code for d=f(d,c,b)=d?c:b, which is11001010, according to the rule for f(d,c,b)=f(c,b,d), to the code11011000.

Encoding—Some special characteristics of this rearrangement is the basisof the manner in which the eight function specification bits arecompressed to seven immediate bits in this instruction. As seen in thetable above, in the general case, a rearrangement of operands fromf(d,c,b) to f(d,b,c) (interchanging rc and rb) requires interchangingthe values of f₆ and f₅ and the values of f₂ and f₁.

Among the 256 possible functions which this instruction can perform, onequarter of them (64 functions) are unchanged by this rearrangement.These functions have the property that f₆=f₅ and f₂=f₁. The values of rcand rb (note that rc and rb are the register specifiers, not theregister contents) can be freely interchanged, and so are sorted intorising or falling order to indicate the value of f₂. (A special casearises when rc=rb, so the sorting of rc and rb cannot conveyinformation. However, as only the values f7, f4, f3, and f0 can everresult in this case, f6, f5, f2, and f1 need not be coded for this case,so no special handling is required.) These functions are encoded by thevalues of f7, f6, f4, f3, and f4 in the immediate field and f2 bywhether rc>rb, thus using 32 immediate values for 64 functions.

Another quarter of the functions have f₆=1 and f₅=0. These functions arerecoded by interchanging rc and rb, f₆ and f₅, f₂ and f₁. They thenshare the same encoding as the quarter of the functions where f6=0 andf5=1, and are encoded by the values of f7, f4, f3, f2, fi, and fo in theimmediate field, thus using 64 immediate values for 128 functions.

The remaining quarter of the functions have f₆=f₅ and f₂.noteq.f₁. Thehalf of these in which f₂=1 and f₁=0 are recoded by interchanging rc andrb, f₆ and f₅, f₂ and f₁. They then share the same encoding as theeighth of the functions where f₂=0 and f₁=1, and are encoded by thevalues of f₇, f₆, f₄, f₃, and f₀ in the immediate field, thus using 32immediate values for 64 functions.

The function encoding is summarized by the table:

f₇ f₆ f₅ f₄ f₃ f₂ f₁ f₀ trc > trb ih il₅ il₄ il₃ il₂ il₁ il₀ rc rb f₆ f₂  f₂ 0 0 f₆ f₇ f₄ f₃ f₀ trc trb f₆ f₂ ~f₂ 0 0 f₆ f₇ f₄ f₃ f₀ trb trc f₆0 1 0 1 f₆ f₇ f₄ f₃ f₀ trc trb f₆ 1 0 0 1 f₆ f₇ f₄ f₃ f₀ trb trc 0 1 1f₂ f₁ f₇ f₄ f₃ f₀ trc trb 1 0 1 f₁ f₂ f₇ f₄ f₃ f₀ trb trc

The function decoding is summarized by the table:

ih il₅ il₄ il₃ il₂ il₁ il₀ rc > rb f₇ f₆ f₅ f₄ f₃ f₂ f₁ f₀ 0 0 0 il₃ il₄il₄ il₂ il₁ 0 0 il₀ 0 0 1 il₃ il₄ il₄ il₂ il₁ 1 1 il₀ 0 1 il₃ il₄ il₄il₂ il₁ 0 1 il₀ 1 il₃ 0 1 il₂ il₁ il₅ il₄ il₀

Group Multiplex

These operations take three values from registers, perform a group ofcalculations on partitions of bits of the operands and place thecatenated results in a fourth register.

In accordance with one embodiment of the invention, the processorhandles group multiplex operations. FIGS. 31D and 31E illustrate anexemplary embodiment of a format and operation codes that can be used toperform the various Group Multiplex instructions. As shown in FIGS. 31Dand 31E, in this exemplary embodiment, the contents of registers rd, rcand rb are fetched. Each bit of the result is equal to the correspondingbit of rc, if the corresponding bit of rd is set, otherwise it is thecorresponding bit of rb. The result is placed into register ra. Whilethe use of three operand registers and a different result register isdescribed here and elsewhere in the present specification, otherarrangements, such as the use of immediate values, may also beimplemented.

The table marked Redundancies in FIG. 31D illustrates that forparticular values of the register specifiers, the Group Multplexoperation performs operations otherwise available within the GroupBoolean instructions. More specifically, when the result register ra isalso present as a source register in the first, second or third sourceoperand position of the operation, the operation is equivalent to theGroup Boolean instruction with arguments of 0.times.11001010,0.times.11100010, or 0.times.11011000 respectively. When the firstsource operand is the same as the second or third source operand, theGroup Multiplex operation is equivalent to a bitwise OR or AND operationrespectively.

Group Add

In accordance with one embodiment of the invention, these operationstake operands from two registers, perform operations on partitions ofbits in the operands, and place the concatenated results in a thirdregister.

In accordance with one embodiment of the invention, the processorhandles a variety of fixed-point, or integer, group operations. Forexample, FIG. 32A presents various examples of Group Add instructionsaccommodating different operand sizes, such as a byte (8 bits), doublet(16 bits), quadlet (32 bits), octlet (64 bits), and hexlet (128 bits).FIGS. 32B and 32C illustrate an exemplary embodiment of a format andoperation codes that can be used to perform the various Group Addinstructions shown in FIG. 32A. As shown in FIGS. 32B and 32C, in thisexemplary embodiment, the contents of registers rc and rb arepartitioned into groups of operands of the size specified and added, andif specified, checked for overflow or limited, yielding a group ofresults, each of which is the size specified. The group of results iscatenated and placed in register rd. While the use of two operandregisters and a different result register is described here andelsewhere in the present specification, other arrangements, such as theuse of immediate values, may also be implemented.

In the present embodiment, for example, if the operand size specified isa byte (8 bits), and each register is 128-bit wide, then the content ofeach register may be partitioned into 16 individual operands, and 16different individual add operations may take place as the result of asingle Group Add instruction. Other instructions involving groups ofoperands may perform group operations in a similar fashion.

Group Subtract

In accordance with one embodiment of the invention, these operationstake two values from registers, perform operations on partitions of bitsin the operands, and place the concatenated results in a register. Twovalues are taken from the contents of registers rc and rb. The specifiedoperation is performed, and the result is placed in register rd.

Similarly, FIG. 33A presents various examples of Group Subtractinstructions accommodating different operand sizes. FIGS. 33B and 33Cillustrate an exemplary embodiment of a format and operation codes thatcan be used to perform the various Group Subtract instructions. As shownin FIGS. 33B and 33C, in this exemplary embodiment, the contents ofregisters rc and rb are partitioned into groups of operands of the sizespecified and subtracted, and if specified, checked for overflow orlimited, yielding a group of results, each of which is the sizespecified. The group of results is catenated and placed in register rd.

Group Set

In accordance with one embodiment of the invention, these operationstake two values from registers, perform operations on partitions of bitsin the operands, and place the concatenated results in a register. Twovalues are taken from the contents of registers rc and rb. The specifiedoperation is performed, and the result is placed in register rd.

FIG. 33A also presents various examples of Group Set instructionsaccommodating different operand sizes. FIG. 33A also presents additionalpseudo-instructions which are equivalent to other Group Set instructionsaccording to the mapping rules further presented in FIG. 33A. FIGS. 33Band 33C illustrate an exemplary embodiment of a format and operationcodes that can be used to perform the various Group Set instructions. Asshown in FIGS. 33B and 33C, in this exemplary embodiment, the contentsof registers rc and rb are partitioned into groups of operands of thesize specified and the specified comparisons are performed, eachproducing a Boolean result repeated to the size specified, yielding agroup of results, each of which is the size specified. The group ofresults is catenated and placed in register rd. In the presentembodiment, certain comparisons between two identically specifiedregisters, for which the result of such comparisons would be predictableno matter what the contents of the register, are used to encodecomparisons against a zero value.

These operations take two values from registers, perform operations onpartitions of bits in the operands, and place the concatenated resultsin a register. Two values are taken from the contents of registers rcand rb. The specified operation is performed, and the result is placedin register rd.

Combination of Group Set and Boolean operations

In an embodiment of the invention, conditional operations are providedin the sense that the set on condition operations can be used toconstruct bit masks that can select between alternate vectorexpressions, using the bitwise Boolean operations.

Ensemble Divide/Multiply

Embodiments of the invention provide for other fixed-point groupoperations also. FIG. 34A presents various examples of Ensemble Divideand Ensemble Multiply instructions accommodating different operandsizes. FIGS. 34B and 34C illustrate an exemplary embodiment of a formatand operation codes that can be used to perform the various EnsembleDivide and Ensemble Multiply instructions. As shown in FIGS. 34B and34C, in this exemplary embodiment, the contents of registers rc and rbare partitioned into groups of operands of the size specified anddivided or multiplied, yielding a group of results. The group of resultsis catenated and placed in register rd.

These operations take operands from two registers, perform operations onpartitions of bits in the operands, and place the concatenated resultsin a third register. Two values are taken from the contents of registersrc and rb. The specified operation is performed, and the result isplaced in register rd.

Group Compare

FIG. 35A presents various examples of Group Compare instructionsaccommodating different operand sizes. FIGS. 35B and 35C illustrate anexemplary embodiment of a format and operational codes that can be usedto perform the various Group Compare instructions. As shown in FIGS. 35Band 35C, in this exemplary embodiment, these operations performcalculations on partitions of bits in two general register values, andgenerate a fixed-point arithmetic exception if the condition specifiedis met. Two values are taken from the contents of registers rd and rc.The specified condition is calculated on partitions of the operands. Ifthe specified condition is true for any partition, a fixed-pointarithmetic exception is generated. This instruction generates no generalpurpose register results.

Ensemble Unary

FIG. 36A presents various examples of Ensemble Unary instructionsaccommodating different operand sizes. FIGS. 36B and 36C illustrate anexemplary embodiment of a format and operational codes that can be usedto perform the various Ensemble Unary instructions. As shown in FIGS.36B and 36C, in this exemplary embodiment, these operations takeoperands from a register, perform operations on partitions of bits inthe operand, and place the concatenated results in a second register.Values are taken from the contents of register rc. The specifiedoperation is performed, and the result is placed in register rd. Thecode E.SUM.U. I in FIG. 36A is preferably encoded as E.SUM.U.128.

Ensemble Floating-Point Add, Divide, Multiply, and Subtract

In accordance with one embodiment of the invention, the processor alsohandles a variety floating-point group operations accommodatingdifferent operand sizes. Here, the different operand sizes may representfloating-point operands of different precisions, such as half-precision(16 bits), single-precision (32 bits), double-precision (64 bits), andquad-precision (128 bits). FIG. 37 illustrates exemplary functions thatare defined for use within the detailed instruction definitions in othersections and figures. In the functions set forth in FIG. 37, an internalformat represents infinite-precision floating-point values as afour-element structure consisting of (1) s (sign bit): 0 for positive, 1for negative, (2) t (type): NORM, ZERO, SNAN, QNAN, INFINITY, (3) e(exponent), and (4) f: (fraction). The mathematical interpretation of anormal value places the binary point at the units of the fraction,adjusted by the exponent: (−1){circumflex over ( )}s*(2 {circumflex over( )}e)*f. The function F converts a packed IEEE floating-point valueinto internal format. The function PackF converts an internal formatback into IEEE floating-point format, with rounding and exceptioncontrol.

FIGS. 38A and 39A present various examples of Ensemble Floating PointAdd, Divide, Multiply, and Subtract instructions. FIGS. 38B-C and 39B-Cillustrate an exemplary 30 embodiment of formats and operation codesthat can be used to perform the various Ensemble Floating Point Add,Divide, Multiply, and Subtract instructions. In these examples, EnsembleFloating Point Add, Divide, and Multiply instructions have been labeledas “EnsembleFloatingPoint.” Also, Ensemble Floating-Point Subtractinstructions have been labeled as “EnsembleReversedFloatingPoint.” Asshown in FIGS. 38B-C and 39B-C, in this exemplary embodiment, thecontents of registers ra and rb (or rc and rb) are partitioned intogroups of operands of the size specified, and the specified groupoperation is performed, yielding a group of results. The group ofresults is catenated and placed in register rc (or rd).

These operations take two values from registers, perform a group offloating-point arithmetic operations on partitions of bits in theoperands, and place the concatenated results in a register. For EnsembleFloating-point operations, the contents of registers ra and rb arecombined using the specified floating-point operation. The result isplaced in register rc. For Ensemble Reversed Floating-point operations,the contents of registers rc and rb are combined using the specifiedfloating-point operation. The result is placed in register rd.

In the present embodiment, the operation is rounded using the specifiedrounding option or using round-to-nearest if not specified. If arounding option is specified, the operation raises a floating-pointexception if a floating-point invalid operation, divide by zero,overflow, or underflow occurs, or when specified, if the result isinexact. If a rounding option is not specified, floating-pointexceptions are not raised, and are handled according to the defaultrules of IEEE 754.

Ensemble Multiply-Add Floating-Point

FIG. 38D presents various examples of Ensemble Floating Point MultiplyAdd instructions. FIGS. 38E-F illustrate an exemplary embodiment offormats and operation codes that can be used to perform the variousEnsemble Floating Point Multiply Add instructions. In these examples,Ensemble Floating Point Multiply Add instructions have been labeled as“EnsembleInplaceFloatingPoint.” As shown in FIGS. 38E-F, in thisexemplary embodiment, operations take operands from three registers,perform operations on partitions of bits in the operands, and place theconcatenated results in the third register. The contents of registersrd, rc and rb are fetched. The specified operation is performed on theseoperands. The result is placed into register rd. Specifically, thecontents of registers rd, rc and rb are partitioned into groups ofoperands of the size specified, and for each partitioned element, thecontents of registers rc and rb are multiplied and added to the contentsof register rd, yielding a group of results. The group of results iscatenated and placed in register rd. Register rd is both a source anddestination of this instruction.

In the present embodiment, the operation is rounded using the specifiedrounding option or using round-to-nearest if not specified. If arounding option is specified, the operation raises a floating-pointexception if a floating-point invalid operation, divide by zero,overflow, or underflow occurs, or when specified, if the result isinexact. If a rounding option is not specified, floating-pointexceptions are not raised, and are handled according to the defaultrules of IEEE 754.

Group Scale-Add Floating-Point

In accordance with one embodiment of the invention, these operationstake three values from registers, perform a group of floating-pointarithmetic operations on partitions of bits in the operands, and placethe concatenated results in a register.

FIG. 38G presents various examples of Ensemble Floating Point Scale Addinstructions. FIGS. 38H-I illustrate an exemplary embodiment of formatsand operation codes that can be used to perform the various EnsembleFloating Point Scale Add instructions. In these examples, EnsembleFloating Point Scale Add instructions have been labeled as“EnsembleTernaryFloatingPoint.” As shown in FIGS. 38E-F, in thisexemplary embodiment, the contents of registers rd and rc are taken torepresent a group of floating-point operands. Operands from register rdare multiplied with a floating-point operand taken from theleast-significant bits of the contents of register rb and added tooperands from register rc multiplied with a floating-point operand takenfrom the next least-significant bits of the contents of register rb. Theresults are concatenated and placed in register ra. In an exemplaryembodiment, the results are rounded to the nearest representablefloating-point value in a single floating-point operation. In anexemplary embodiment, floating-point exceptions are not raised, and arehandled according to the default rules of IEEE 754. In an exemplaryembodiment, these instructions cannot select a directed rounding mode ortrap on inexact.

Group Set Floating-Point

In accordance with one embodiment of the invention, these operationstake two values 30 from registers, perform a group of floating-pointarithmetic operations on partitions of bits in the operands, and placethe concatenated results in a register. The contents of registers ra andrb are combined using the specified floating-point operation. The resultis placed in register rc. The operation is rounded using the specifiedrounding option or using round-to-nearest if not specified. If arounding option is specified, the operation raises a floating-pointexception if a floating-point invalid operation, divide by zero,overflow, or underflow occurs, or when specified, if the result isinexact. If a rounding option is not specified, floating-pointexceptions are not raised, and are handled according to the defaultrules of IEEE 754.

FIG. 39D also presents various examples of Group Set Floating-pointinstructions accommodating different operand sizes. FIG. 39E alsopresents additional pseudo-instructions which are equivalent to otherGroup Set Floating-Point instructions according to the mapping rulesfurther presented in FIG. 39E. FIGS. 39F and 39G illustrate an exemplaryembodiment of a format and operation codes that can be used to performthe various Group Set instructions. As shown in FIG. 39G, in thisexemplary embodiment, the contents of registers rc and rb arepartitioned into groups of operands of the size specified and thespecified comparisons are performed, each producing a Boolean resultrepeated to the size specified, yielding a group of results, each ofwhich is the size specified. The group of results is catenated andplaced in register rd. If a rounding mode is specified a floating-pointexception is raised if any operand is-a SNAN, or when performing a Lessor Greater Equal comparison, any operand is a QNAN. If a rounding optionis not specified, floating-point exceptions are not raised, and arehandled according to the default rules of IEEE 754.

Group Compare Floating-Point

FIG. 40A presents various examples of Group Compare Floating-pointinstructions accommodating different operand sizes. FIGS. 40B and 40Cillustrate an exemplary embodiment of a format and operational codesthat can be used to perform the various Group Compare Floating-pointinstructions. As shown in FIGS. 40B and 40C, in this exemplaryembodiment, these operations perform calculations on partitions of bitsin two general register values, and generate a floating-point arithmeticexception if the condition specified is met. The contents of registersrd and rc are compared using the specified floating-point condition. Ifthe result of the comparison is true for any corresponding pair ofelements, a floating-point exception is raised. If a rounding option isspecified, the operation raises a floating-point exception if afloating-point invalid operation occurs. If a rounding option is notspecified, floating-point exceptions are not raised, and are handledaccording to the default rules of IEEE 754.

Ensemble Unary Floating-Point

FIG. 41A presents various examples of Ensemble Unary Floating-pointinstructions accommodating different operand sizes. FIGS. 41B and 41Cillustrate an exemplary embodiment of a format and operational codesthat can be used to perform the various Ensemble Unary Floating-pointinstructions. As shown in FIGS. 41B and 41C, in this exemplaryembodiment, these operations take one value from a register, perform agroup of floating-point arithmetic operations on partitions of bits inthe operands, and place the concatenated results in a register. Thecontents of register rc is used as the operand of the specifiedfloating-point operation. The result is placed in register rd. Theoperation is rounded using the specified rounding option or usinground-to-nearest if not specified. If a rounding option is specified,unless default exception handling is specified, the operation raises afloating-point exception if a floating-point invalid operation, divideby zero, overflow, or underflow occurs, or when specified, if the resultis inexact. If a rounding option is not specified or if defaultexception handling is specified, floating-point exceptions are notraised, and are handled according to the default rules of IEEE 754. Thereciprocal estimate and reciprocal square root estimate instructionscompute an exact result for half precision, and a result with at least12 bits of significant precision for larger formats.

Ensemble Multiply Galois Field

In accordance with one embodiment of the invention, the processorhandles different Galois filed operations. For example, FIG. 42Apresents various examples of Ensemble Multiply Gaois Field instructionsaccommodating different operand sizes. FIGS. 42B and 42C illustrate anexemplary embodiment of a format and operation codes that can be used toperform the Ensemble Multiply Gaois Field instructions shown in FIG.42A. As shown in FIGS. 42B and 32C, in this exemplary embodiment, thecontents of registers rd, rc, and rb are fetched. The specifiedoperation is performed on these operands. The result is placed intoregister ra.

The contents of registers rd and rc are partitioned into groups ofoperands of the size specified and multiplied in the manner ofpolynomials. The group of values is reduced modulo the polynomialspecified by the contents of register rb, yielding a group of results,each of which is the size specified. The group of results is catenatedand placed in register ra.

An ensemble multiply Galois field bytes instruction (E.MULG.8)multiplies operand [d15 d14 d13 d12 d11 d10 d9 d8 d7 d6 d5 d4 d3 d2 d1d0] by operand [c15 c14 c13 c12 c11 c10 c9 c8 c7 c6 c5 c4 c3 c2 c1 c0],modulo polynomial [q], yielding the results [(d15c15 mod q) (d14c14 modq) . . . (d0c0 mod q), as illustrated in FIG. 42D.

Compress, Expand, Rotate and Shift

In accordance with one embodiment of the invention, these operationstake operands from two registers, perform operations on partitions ofbits in the operands, and place the concatenated results in a thirdregister. Two values are taken from the contents of registers rc and rb.The specified operation is performed, and the result is placed inregister rd.

In one embodiment of the invention, crossbar switch units such as units142 and 148 perform data handling operations, as previously discussed.As shown in FIG. 43A, such data handling operations may include variousexamples of Crossbar Compress, Crossbar Expand, Crossbar Rotate, andCrossbar Shift operations. FIGS. 43B and 43C illustrate an exemplaryembodiment of a format and operation codes that can be used to performthe various Crossbar Compress, Crossbar Expand, Crossbar Rotate, andCrossbar Shift instructions. As shown in FIGS. 43B and 43C, in thisexemplary embodiment, the contents of registers rc and rb are obtainedand the contents of register rc is partitioned into groups of operandsof the size specified and the specified operation is performed using ashift amount obtained from the contents of register rb masked to valuesfrom zero to one less than the size specified, yielding a group ofresults. The group of results is catenated and placed in register rd.

Various Group Compress operations may convert groups of operands fromhigher precision data to lower precision data. An arbitrary half-sizedsub-field of each bit field can be selected to appear in the result. Forexample, FIG. 43D shows an X.COMPRESS.16 rd=rc, 4 operation, whichperforms a selection of bits 19.4 of each quadlet in a hexlet. VariousGroup Shift operations may allow shifting of groups of operands by aspecified number of bits, in a specified direction, such as shift rightor shift left. As can be seen in FIG. 43C, certain Group Shift Leftinstructions may also involve clearing (to zero) empty low order bitsassociated with the shift, for each operand. Certain Group Shift Rightinstructions may involve clearing (to zero) empty high order bitsassociated with the shift, for each operand. Further, certain GroupShift Right instructions may involve filling empty high order bitsassociated with the shift with copies of the sign bit, for each operand.

Shift Merge

In accordance with one embodiment of the invention, these operationstake operands from three registers, perform operations on partitions ofbits in the operands, and place the concatenated results in the thirdregister. The contents of registers rd, rc and rb are fetched. Thespecified operation is performed on these operands. The result is placedinto register rd.

In one embodiment of the invention, as shown in FIG. 43E, such datahandling operations may also include various examples of Shift Mergeoperations. FIGS. 43F and 43G illustrate an exemplary embodiment of aformat and operation codes that can be used to perform the various ShiftMerge instructions. As shown in FIGS. 43F and 43G, in this exemplaryembodiment, the contents of registers rd, and rc are obtained and thecontents of register rd and rc are partitioned into groups of operandsof the size-specified, and the specified operation is performed using ashift amount obtained from the contents of register rb masked to valuesfrom zero to one less than the size specified, yielding a group ofresults. The group of results is catenated and placed in register rd.Register rd is both a source and destination of this instruction.

Shift Merge operations may allow shifting of groups of operands by aspecified number of bits, in a specified direction, such as shift rightor shift left. As can be seen in FIG. 43G, certain Shift Mergeoperations may involve filling empty bits associated with the shift withcopies of corresponding bits from the contents of register rd, for eachoperand.

Compress, Expand, Rotate and Shift Immediate

In accordance with one embodiment of the invention, these operationstake operands from a register and a short immediate value, performoperations on partitions of bits in the operands, and place theconcatenated results in a register. A 128-bit value is taken from thecontents of register rc. The second operand is taken from simm. Thespecified operation is performed, and the result is placed in registerrd.

In one embodiment of the invention, crossbar switch units such as units142 and 148 perform data handling operations, as previously discussed.As shown in FIG. 43H, such data handling operations may include variousexamples of Crossbar Compress Immediate, Crossbar Expand Immediate,Crossbar Rotate Immediate, and Crossbar Shift Immediate operations.FIGS. 43I and 43J illustrate an exemplary embodiment of a format andoperation codes that can be used to perform the various CrossbarCompress Immediate, Crossbar Expand Immediate, Crossbar RotateImmediate, and Crossbar Shift Immediate instructions. As shown in FIGS.43I and 43J, in this exemplary embodiment, the contents of register rcis obtained and is partitioned into groups of operands of the sizespecified and the specified operation is performed using a shift amountobtained from the instruction masked to values from zero to one lessthan the size specified, yielding a group of results. The group ofresults is catenated and placed in register rd.

Various Group Compress Immediate operations may convert groups ofoperands from higher precision data to lower precision data. Anarbitrary half-sized sub-field of each bit field can be selected toappear in the result. For example, FIG. 43D shows an X.COMPRESS. 16rd=rc,4 operation, which performs a selection of bits 19 . . . 4 of eachquadlet in a hexlet. Various Group Shift Immediate operations may allowshifting of groups of operands by a specified number of bits, in aspecified direction, such as shift right or shift left. As can be seenin FIG. 43J, certain Group Shift Left Immediate instructions may alsoinvolve clearing (to zero) empty low order bits associated with theshift, for each operand. Certain Group Shift Right Immediateinstructions may involve clearing (to zero) empty high order bitsassociated with the shift, for each operand. Further, certain GroupShift Right Immediate instructions may involve filling empty high orderbits associated with the shift with copies of the sign bit, for eachoperand.

Shift Merge Immediate

In accordance with one embodiment of the invention, these operationstake operands from two registers and a short immediate value, performoperations on partitions of bits in the operands, and place theconcatenated results in the second register. Two 128-bit values aretaken from the contents of registers rd and rc. A third operand is takenfrom simm. The specified operation is performed, and the result isplaced in register rd. This instruction is undefined and causes areserved instruction exception if the simm field is greater or equal tothe size specified.

In one embodiment of the invention, as shown in FIG. 43K, such datahandling operations may also include various examples of Shift MergeImmediate operations. FIGS. 43L 30 and 43M illustrate an exemplaryembodiment of a format and operation codes that can be used to performthe various Shift Merge Immediate instructions. As shown in FIGS. 43Land 43M, in this exemplary embodiment, the contents of registers rd andrc are obtained and are partitioned into groups of operands of the sizespecified, and the specified operation is performed using a shift amountobtained from the instruction masked to values from zero to one lessthan the size specified, yielding a group of results. The group ofresults is catenated and placed in register rd. Register rd is both asource and destination of this instruction.

Shift Merge operations may allow shifting of groups of operands by aspecified number of bits, in a specified direction, such as shift rightor shift left. As can be seen in FIG. 43G, certain Shift Mergeoperations may involve filling empty bits associated with the shift withcopies of corresponding bits from the contents of register rd, for eachoperand.

Crossbar Extract

In one embodiment of the invention, data handling operations may alsoinclude a Crossbar Extract instruction. These operations take operandsfrom three registers, perform operations on partitions of bits in theoperands, and place the concatenated results in a fourth register. FIGS.44A and 44B illustrate an exemplary embodiment of a format and operationcodes that can be used to perform the Crossbar Extract instruction.These operations take operands from three registers, perform operationson partitions of bits in the operands, and place the concatenatedresults in a fourth register. As shown in FIGS. 44A and 44B, in thisexemplary embodiment, the contents of registers rd, rc, and rb arefetched. The specified operation is performed on these operands. Theresult is placed into register ra.

The Crossbar Extract instruction allows bits to be extracted fromdifferent operands in various ways. Specifically, bits 31 . . . 0 of thecontents of register rb specifies several parameters which control themanner in which data is extracted, and for certain operations, themanner in which the operation is performed. The position of the controlfields allows for the source position 25 to be added to a fixed controlvalue for dynamic computation, and allows for the lower 16 bits of thecontrol field to be set for some of the simpler extract cases by asingle GCOPYI.128 instruction. The control fields are further arrangedso that if only the low order 8 bits are non-zero, a 128-bit extractionwith truncation and no rounding is performed:

The table below describes the meaning of each label:

label bits meaning fsize 8 field size dpos 8 destination position x 1reserved s 1 signed vs. unsigned n 1 reserved m 1 merge vs. extract l 1reserved rnd 2 reserved gssp 9 group size and source position

The 9-bit gssp field encodes both the group size, gsize, and sourceposition, spos, according to the formula gssp=512-4*gsize+spos. Thegroup size, gsize, is a power of two in the range 1 . . . 128. Thesource position, spos, is in the range 0 . . . (2*gsize)−1.

The values in the s, n, m, l, and rnd fields have the following meaning:

values s n m l rnd 0 unsigned extract 1 signed merge 2 3

For the E.SCAL.ADD.X instruction, bits 127..64 of the contents ofregister rc specifies the multipliers for the multiplicands in registersra and rb. Specifically, bits 64+2*gsize−1..64+gsize is the multiplierfor the contents of register ra, and bits 64+gsize-1..64 is themultiplier for the contents of register rb.

As shown in FIG. 44C, for the X.EXTRACT instruction, when m=0, theparameters are interpreted to select a fields from the catenatedcontents of registers rd and rc, extracting values which are catenatedand placed in register ra. As shown in FIG. 44D, for acrossbar-merge-extract (X.EXTR.ACT when m=1), the parameters areinterpreted to merge fields from the contents of register rd with thecontents of register rc. The results are catenated and placed inregister ra.

Ensemble Extract

In one embodiment of the invention, data handling operations may alsoinclude an Ensemble Extract instruction. These operations take operandsfrom three registers, perform operations on partitions of bits in theoperands, and place the concatenated results in a fourth register. FIGS.44E, 44F and 44G illustrate an exemplary embodiment of a format andoperation codes that can be used to perform the Ensemble Extractinstruction. As shown in FIGS. 44F and 44G, in this exemplaryembodiment, the contents of registers rd, rc, and rb are fetched. Thespecified operation is performed on these operands. The result is placedinto register ra.

The Crossbar Extract instruction allows bits to be extracted fromdifferent operands in various ways. Specifically, bits 31 . . . 0 of thecontents of register rb specifies several parameters which control themanner in which data is extracted, and for certain operations, themanner in which the operation is performed. The position of the controlfields allows for the source position to be added to a fixed controlvalue for dynamic computation, and allows for the lower 16 bits of thecontrol field to be set for some of the simpler extract cases by asingle GCOPYI.128 instruction. The control fields are further arrangedso that if only the low order 8 bits are non-zero, a 128-bit extractionwith truncation and no rounding is performed:

The table below describes the meaning of each label:

label bits meaning fsize 8 field size dpos 8 destination position x 1reserved s 1 signed vs. unsigned n 1 complex vs. real multiplication m 1merge vs. extract or mixed-sign vs. same-sign multiplication l 1 limit:saturation vs. truncation rnd 2 rounding gssp 9 group size and sourceposition

The 9-bit gssp field encodes both the group size, gsize, and sourceposition, spos, according to the formula gssp=512-4*gsize+spos. Thegroup size, gsize, is a power of two in the range 1 . . . 128. Thesource position, spos, is in the range 0 . . . (2*gsize)−1.

The values in the s, n, m, 1, and rnd fields have the following meaning:

values s n m l rnd 0 unsigned real extract/same-sign truncate F 1 signedcomplex merge/mixed-sign saturate Z 2 N 3 C

As shown in FIG. 44H, an ensemble-multiply-extract-doublets instruction(E.MULX) multiplies vector ra [h g f e d c b a] with vector rb [p o n ml k j i], yielding the result [hp go fn em dl ck bj ai], rounded andlimited as specified by rc31..0.

As shown in FIG. 44I, an ensemble-multiply-extract-doublets-complexinstruction (E.MUL.X with n set) multiplies operand [h g f e d c b a] byoperand [p o n m l k j i], yielding the result [gp+ho go−hp en+fm em−fncl+dk ck−dl aj+bi ai−bj], rounded and limited as specified. Note thatthis instruction prefers an organization of complex numbers in which thereal part is located to the right (lower precision) of the imaginarypart.

As shown in FIG. 44J, an ensemble-scale-add-extract-doublets instruction(E.SCAL.ADD.X) multiplies vector ra [h g f e d c b a] with TC_(95..80)[r] and adds the product to the product of vector rb [p o n m l k j i]with rc_(79..64) [q], yielding the result [hr+pq gr+oq fr+nq er+mq dr+lqcr+kq br+jq ar+iq], rounded and limited as specified by rc_(31..0).

As shown in FIG. 44K, an ensemble-scale-add-extract-doublets-complexinstruction (E.SCLADD.X with n set) multiplies vector ra [h g f e d c ba] with rc12..76 [t s] and adds the product to the product of vector rb[p o n m l k j i] with rc_(95..64) [r q], yielding the result [hs+gt+pq+or gs−ht+oq−pr fs+et+nq+mr es−ft+mq−nr ds+ct+lq+kr cs−dt+kq−lrbs+at+jq+lr as−bt+iq jr], rounded and limited as specified byrc_(31..0).

As shown in FIG. 44C, for the E.EXTRACT instruction, when m=0, theparameters are interpreted to select a fields from the catenatedcontents of registers rd and rc, extracting values 10 which arecatenated and placed in register ra. As shown in FIG. 44D, for anensemble-merge-extract (E.EXTRACT when m=1), the parameters areinterpreted to merge fields from the contents of register rd with thecontents of register rc. The results are catenated and placed inregister ra. As can be seen from FIG. 44G, the operand portion to theleft of the selected field is treated as signed or unsigned ascontrolled by the s field, and truncated or saturated as controlled bythe t field, while the operand portion to the right of the selectedfield is rounded as controlled by the rnd field.

Deposit and Withdraw

As shown in FIG. 45A, in one embodiment of the invention, data handlingoperations include various Deposit and Withdraw instructions. FIGS. 45Band 45C illustrate an exemplary embodiment of a format and operationcodes that can be used to perform the various Deposit and Withdrawinstructions. As shown in FIGS. 45B and 45C, in this exemplaryembodiment, these operations take operands from a register and twoimmediate values, perform operations on partitions of bits in theoperands, and place the concatenated results in the second register.Specifically, the contents of register rc are fetched, and 7-bitimmediate values are taken from the 2-bit ih and the 6-bit gsfp and gsfsfields. The specified operation is performed on these operands. Theresult is placed into register rd.

FIG. 45D shows legal values for the ih, gsfp and gsfs fields, indicatingthe group size to which they apply. The ih, gsfp and gsfs fields encodethree values: the group size, the field size, and a shift amount. Theshift amount can also be considered to be the source bit field positionfor group-withdraw instructions or the destination bit field positionfor group-deposit instructions. The encoding is designed so thatcombining the gsfp and gsfs fields with a bitwise-and produces a resultwhich can be decoded to the group size, and so the field size and shiftamount can be easily decoded once the group size has been determined.

As shown in FIG. 45E, the crossbar-deposit instructions deposit a bitfield from the lower bits of each group partition of the source to aspecified bit position in the result. The value is either sign-extendedor zero-extended, as specified. As shown in FIG. 45F, thecrossbar-withdraw instructions withdraw a bit field from a specified bitposition in the each group partition of the source and place it in thelower bits in the result. The value is either sign-extended orzero-extended, as specified.

Deposit Merge

As shown in FIG. 45G, in one embodiment of the invention, data handlingoperations include various Deposit Merge instructions. FIGS. 45H and 45Iillustrate an exemplary embodiment of a format and operation codes thatcan be used to perform the various Deposit Merge instructions. As shownin FIGS. 45H and 45I, in this exemplary embodiment, these operationstake operands from two registers and two immediate values, performoperations on partitions of bits in the operands, and place theconcatenated results in the second register. Specifically, the contentsof registers rc and rd are fetched, and 7-bit immediate values are takenfrom the 2-bit ih and the 6-bit gsfp and gsfs fields. The specifiedoperation is performed on these operands. The result is placed intoregister rd.

FIG. 45D shows legal values for the ih, gsfp and gsfs fields, indicatingthe group size to which they apply. The ih, gsfp and gsfs fields encodethree values: the group size, the field size, and a shift amount. Theshift amount can also be considered to be the source bit field positionfor group-withdraw instructions or the destination bit field positionfor group-deposit instructions. The encoding is designed so thatcombining the gsfp and gsfs fields with a bitwise-and produces a resultwhich can be decoded to the group size, and so the field size and shiftamount can be easily decoded once the group size has been determined.

As shown in FIG. 45J, the crossbar-deposit-merge instructions deposit abit field from the lower bits of each group partition of the source to aspecified bit position in the result. The value is merged with thecontents of register rd at bit positions above and below the depositedbit field. No sign- or zero-extension is performed by this instruction.

Shuffle

In accordance with one embodiment of the invention, these operationstake operands from two registers, perform operations on partitions ofbits in the operands, and place the concatenated results in a register.

As shown in FIG. 46A, in one embodiment of the invention, data handlingoperations may also include various Shuffle instructions, which allowthe contents of registers to be partitioned into groups of operands andinterleaved in a variety of ways. FIGS. 46B and 46C illustrate anexemplary embodiment of a format and operation codes that can be used toperform the various Shuffle instructions. As shown in FIGS. 46B and 46C,in this exemplary embodiment, one of two operations is performed,depending on whether the rc and rb fields are equal. Also, FIG. 46B andthe Description below illustrate the format of and relationship of therd, re, rb, op, v, w, h, and size fields.

In the present embodiment, if the re and rb fields are equal, a 128-bitoperand is taken from the contents of register rc. Items of size v aredivided into w piles and shuffled together, within groups of size bits,according to the value of op. The result is placed in register rd.

FIG. 46C illustrates that for this operation, values of three parametersx, y, and z are computed depending on the value of op, and in eachresult bit position i, a source bit position within the contents ofregister rc is selected, wherein the source bit position is thecatenation of four fields, the first and fourth fields containing fieldsof i which are unchanged: 6 . . . x and y−_(1..0), and the second andthird fields containing a subfield of i, bits x−1 . . . y which isrotated by an amount z: y+z−I . . . y and x−1 . . . y+z.

Further, if the rc and rb fields are not equal, the contents ofregisters rc and rb are catenated into a 256-bit operand. Items of sizev are divided into w piles and shuffled together, according to the valueof op. Depending on the value of h, a sub-field of op, the low 128 bits(h=0), or the high 128 bits (h=1) of the 256-bit shuffled contents areselected as the result. The result is placed in register rd.

This instruction is undefined and causes a reserved instructionexception if rc and rb are not equal and the op field is greater orequal to 56, or if rc and rb are equal and op4..0 is greater or equal to28.

FIG. 46C illustrates that for this operation, the value of x is fixed,and values of two 5 parameters y and z are computed depending on thevalue of op, and in each result bit position i, a source bit positionwithin the contents of register rc is selected, wherein the source bitposition is the catenation of three fields, the first field containing afields of i which is unchanged: y−1 . . . 0, and the second and thirdfields containing a subfield of i, bits x−1 . . . y which is rotated byan amount z: y+z−1 . . . y and x−1 . . . y+z.

As shown in FIG. 46D, an example of a crossbar 4-way shuffle of byteswithin hexlet instruction (X.SHUFFLE.128 rd=rcb,8,4) divides the 128-bitoperand into 16 bytes and partitions the bytes 4 ways (indicated byvarying shade in the diagram below). The 4 partitions are perfectlyshuffled, producing a 128-bit result. As shown in FIG. 46E, an exampleof a crossbar 4-way shuffle of bytes within triclet instruction(X.SHUFFLE.256 rd=rc,rb,8,4,0) catenates the contents of rc and rb, thendivides the 256-bit content into 32 bytes and partitions the bytes 4ways (indicated by varying shade in the diagram below). The low-orderhalves of the 4 partitions are perfectly shuffled, producing a 128-bitresult.

Changing the last immediate value h to 1 (X.SHUFFLE.256 rd=rc,rb,8,4,1)modifies the operation to perform the same function on the high-orderhalves of the 4 partitions. When rc and rb are equal, the table belowshows the value of the op field and associated values for size, v, andw.

op size v w 0 4 1 2 1 8 1 2 2 8 2 2 3 8 1 4 4 16 1 2 5 16 2 2 6 16 4 2 716 1 4 8 16 2 4 9 16 1 8 10 32 1 2 11 32 2 2 12 32 4 2 13 32 8 2 14 32 14 15 32 2 4 16 32 4 4 17 32 1 8 18 32 2 8 19 32 1 16 20 64 1 2 21 64 2 222 64 4 2 23 64 8 2 24 64 16 2 25 64 1 4 26 64 2 4 27 64 4 4 28 64 8 429 64 1 8 30 64 2 8 31 64 4 8 32 64 1 16 33 64 2 16 34 64 1 32 35 128 12 36 128 2 2 37 128 4 2 38 128 8 2 39 128 16 2 40 128 32 2 41 128 1 4 42128 2 4 43 128 4 4 44 128 8 4 45 128 16 4 46 128 1 8 47 128 2 8 48 128 48 49 128 8 8 50 128 1 16 51 128 2 16 52 128 4 16 53 128 1 32 54 128 2 3255 128 1 64

When rc and rb are not equal, the table below shows the value of the op₄. . . 0 field and associated values for size, v, and w: Op₅ is the valueof h, which controls whether the low-order or high-order half of eachpartition is shuffled into the result.

op₄ . . . 0 size v w 0 256 1 2 1 256 2 2 2 256 4 2 3 256 8 2 4 256 16 25 256 32 2 6 256 64 2 7 256 1 4 8 256 2 4 9 256 4 4 10 256 8 4 11 256 164 12 256 32 4 13 256 1 8 14 256 2 8 15 256 4 8 16 256 8 8 17 256 16 8 18256 1 16 19 256 2 16 20 256 4 16 21 256 8 16 22 256 1 32 23 256 2 32 24256 4 32 25 256 1 64 26 256 2 64 27 256 1 128

Swizzle

In accordance with one embodiment of the invention, these operationsperform calculations with a general register value and immediate values,placing the result in a general register.

In one embodiment of the invention, data handling operations may alsoinclude various Crossbar Swizzle instruction. FIGS. 47A and 47Billustrate an exemplary embodiment of a format and operation codes thatcan be used to perform Crossbar Swizzle instructions. As shown in FIGS.47A and 47B, in this exemplary embodiment, the contents of register rcare fetched, and 7-bit immediate values, icopy and iswap, areconstructed from the 2-bit ih field and from the 6-10 bit icopya andiswapa fields. The specified operation is performed on these operands.The result is placed into register rd.

The “swizzle” operation can reverse the order of the bit fields in ahexlet. For example, a X.SWIZZLE rd=rc,127,112 operation reverses thedoublets within a hexlet, as shown in FIG. 47C. In some cases, it isdesirable to use a group instruction in which one or more operands is asingle value, not an array. The “swizzle” operation can also copyoperands to multiple locations within a hexlet. For example, a X.SWIZZLE15,0 operation copies the low-order 16 bits to each double within ahexlet.

Select

In accordance with one embodiment of the invention, these operationstake three values from registers, perform a group of calculations onpartitions of bits of the operands and place the catenated results in afourth register. The contents of registers rd, rc, and rb are fetched.The specified operation is performed on these operands. The result isplaced into register ra.

In one embodiment of the invention, data handling operations may alsoinclude various Crossbar Select instruction. FIGS. 47D and 47Eillustrate an exemplary embodiment of a format and operation codes thatcan be used to perform Crossbar Select instructions. As shown in FIGS.47D and 47E, in this exemplary embodiment, the contents of registers rd,rc and rb are fetched, and the contents of registers rd and rc arecatenated, producing catenated data dc. The contents of register rb ispartitioned into elements, and the value expressed in each partition isemployed to select one partitioned element of the catenated data dc. Theselected elements are catenated together, and result is placed intoregister ra.

Load and Load Immediate

As shown in FIGS. 50A and 51A, in one embodiment of the invention,memory access operations may also include various Load and LoadImmediate instructions. These figures and FIGS. 50B and 51B show thatthe various Load and Load Immediate instructions specify a type ofoperand, either signed, or unsigned, represented by omitting orincluding a U, respectively. The instructions further specify a size ofmemory operand, byte, double, quadlet, octlet, or hexlet, representing8, 16, 32, 64, and 128 bits respectively. The instructions furtherspecify aligned memory operands, or not, represented by including a A,or with the A omitted, respectively. The instructions further specify abyte-ordering of the memory operand, either big-endian, orlittle-endian, represented by B, and L respectively.

Each instruction specifies the above items with the followingexceptions: L.8, L.U8, L.I.8, L.I.U8 need not distinguish betweenlittle-endian and big-endian ordering, nor between aligned andunaligned, as only a single byte is loaded. L.128.B, L.128.AB, L.128.L,L.128AL, L.I.128.8, L.I.128.AB, L.I.128.L, and L.I.128AL need notdistinguish between signed and unsigned, as the hexlet fills thedestination register.

Regarding footnote 1 in FIG. 50A, L.8 need not distinguish betweenlittle-endian and big-endian ordering, nor between aligned andunaligned, as only a single byte is loaded.

Regarding footnote 2 in FIG. 50A, L.128.B need not distinguish betweensigned and unsigned, as the hexlet fills the destination register.

Regarding footnote 3 in FIG. 50A, L.128.AB need not distinguish betweensigned and unsigned, as the hexlet fills the destination register.

Regarding footnote 4 in FIG. 50A, L.128.L need not distinguish betweensigned and unsigned, as the hexlet fills the destination register.

Regarding footnote 5 in FIG. 50A, L.128.AL need not distinguish betweensigned and unsigned, as the hexlet fills the destination register.

Regarding footnote 6 in FIG. 50A, L.U8 need not distinguish betweenlittle-endian and big-endian ordering, nor between aligned andunaligned, as only a single byte is loaded.

Regarding footnote 1 in FIG. 51A, LI.8 need not distinguish betweenlittle-endian and big-endian ordering, nor between aligned andunaligned, as only a single byte is loaded.

Regarding footnote 2 in FIG. 51A, LI.128.AB need not distinguish betweensigned and unsigned, as the hexlet fills the destination register.

Regarding footnote 3 in FIG. 51A, LI.128.B need not distinguish betweensigned and unsigned, as the hexlet fills the destination register.

Regarding footnote 4 in FIG. 51A, 0.128.AL need not distinguish betweensigned and unsigned, as the hexlet fills the destination register.

Regarding footnote 5 in FIG. 51A, LI.128.L need not distinguish betweensigned and unsigned, as the hexlet fills the destination register.

Regarding footnote 6 in FIG. 51A, LI.U8 need not distinguish betweenlittle-endian and big-endian ordering, nor between aligned andunaligned, as only a single byte is loaded.

FIGS. 50B and 50C illustrate an exemplary embodiment of formats andoperation codes that can be used to perform Load instructions. Theseoperations compute a virtual address from the contents of two registers,load data from memory, sign- or zero-extending the data to fill thedestination register. As shown in FIGS. 50B and 50C, in this exemplaryembodiment, an operand size, expressed in bytes, is specified by theinstruction. A virtual address is computed from the sum of the contentsof register rc and the contents of register rb multiplied by operandsize.

FIGS. 51B and 51C illustrate an exemplary embodiment of formats andoperation codes that can be used to perform Load Immediate instructions.These operations compute a virtual address from the contents of aregister and a sign-extended immediate value, load data from memory,sign- or zero-extending the data to fill the destination register. Asshown in FIGS. 51B and 51C, in this exemplary embodiment, an operandsize, expressed in bytes, is specified by the instruction. A virtualaddress is computed from the sum of the contents of register rc and thesign-extended value of the offset field, multiplied by the operand size.

In an exemplary embodiment, for both Load and Load Immediateinstructions, the contents of memory using the specified byte order areread, treated as the size specified, zero-extended or sign-extended asspecified, and placed into register rd. If alignment is specified, thecomputed virtual address must be aligned, that is, it must be an exactmultiple of the size expressed in bytes. If the address is not alignedan “access disallowed by virtual address” exception occurs.

Store and Store Immediate

As shown in FIGS. 52A and 53A, in one embodiment of the invention,memory access operations may also include various Store and StoreImmediate instructions. These figures and FIGS. 52B and 53B show thatthe various Store and Store Immediate instructions specify a size ofmemory operand, byte, double, quadlet, octlet, or hexlet, representing8, 16, 32, 64, and 128 bits respectively. The instructions furtherspecify aligned memory operands, or not, represented by including a A,or with the A omitted, respectively. The instructions further specify abyte-ordering of the memory operand, either big-endian, orlittle-endian, represented by B, and L respectively.

Each instruction specifies the above items with the followingexceptions: L.8 and L.1.8 need not distinguish between little-endian andbig-endian ordering, nor between aligned and unaligned, as only a singlebyte is stored.

Regarding footnote I in FIG. 52A, S.8 need not specify byte ordering,nor need it specify alignment checking, as it stores a single byte.

Regarding footnote I in FIG. 53A, SI.8 need not specify byte ordering,nor need it specify alignment checking, as it stores a single byte.

FIGS. 52B and 52C illustrate an exemplary embodiment of formats andoperation codes that can be used to perform Store instructions. Theseoperations add the contents of two registers to produce a virtualaddress, and store the contents of a register into memory. As shown inFIGS. 52B and 52C, in this exemplary embodiment, an operand size,expressed in bytes, is specified by the instruction. A virtual addressis computed from the sum of the contents of register rc and the contentsof register rb multiplied by operand size.

FIGS. 53B and 53C illustrate an exemplary embodiment of formats andoperation codes that can be used to perform Store Immediateinstructions. These operations add the contents of a register to asign-extended immediate value to produce a virtual address, and storethe contents of a register into memory. As shown in FICTS. 53B and 53C,in this exemplary embodiment, an operand size, expressed in bytes, isspecified by the instruction. A virtual address is computed from the sumof the contents of register rc and the sign-extended value of the offsetfield, multiplied by the operand size.

In an exemplary embodiment, for both Store and Store Immediateinstructions, the contents of register rd, treated as the sizespecified, is stored in memory using the specified byte order. Ifalignment is specified, the computed virtual address must be aligned,that is, it must be an exact multiple of the size expressed in bytes. Ifthe address is not aligned an” access disallowed by virtual address”exception occurs.

Store Multiplex and Store Multiplex Immediate

As shown in FIGS. 52A and 53A, in one embodiment of the invention,memory access operations may also include various Store Multiplex andStore Multiplex Immediate instructions. These figures and FIGS. 52B and53B show that the various Store Multiplex and Store Multiplex Immediateinstructions specify a size of memory operand, octlet, representing 64bits. The instructions further specify aligned memory operands,represented by including a A. The instructions further specify abyte-ordering of the memory operand, either big-endian, orlittle-endian, represented by B, and L respectively.

FIGS. 52B and 52C illustrate an exemplary embodiment of formats andoperation codes that can be used to perform Store Multiplexinstructions. As shown in FIGS. 52B and 52C, in this exemplaryembodiment, an operand size, expressed in bytes, is specified by theinstruction. A virtual address is computed from the sum of the contentsof register rc and the contents of register rb multiplied by operandsize.

FIGS. 53B and 53C illustrate an exemplary embodiment of formats andoperation codes that can be used to perform Store Multiplex Immediateinstructions. As shown in FIGS. 53B and 53C, in this exemplaryembodiment, an operand size, expressed in bytes, is specified by theinstruction. A virtual address is computed from the sum of the contentsof register rc and the sign-extended value of the offset field,multiplied by the operand size.

In an exemplary embodiment, for both Store Multiplex and Store MultiplexImmediate instructions, data contents and mask contents of the contentsof register rd are identified. The data contents are stored in memoryusing the specified byte order for values in which the correspondingmask contents are set. In an exemplary embodiment, it can be understoodthat masked writing of data can be accomplished by indivisibly readingthe original contents of the addressed memory operand, modifying thevalue, and writing the modified value back to the addressed memoryoperand. In an exemplary embodiment, the modification of the value isaccomplished using an operation previously identified as a Multiplexoperation in the section titled Group Multiplex, above, and in FIG. 31E.

In an exemplary embodiment, for both Store Multiplex and Store MultiplexImmediate instructions, the computed virtual address must be aligned,that is, it must be an exact multiple of the size expressed in bytes. Ifthe address is not aligned an “access disallowed by virtual address”exception occurs.

Additional Load and Execute Resources

In an exemplary embodiment, studies of the dynamic distribution ofinstructions on various benchmark suites indicate that the mostfrequently-issued instruction classes are load instructions and executeinstructions. In an exemplary embodiment, it is advantageous to considerexecution pipelines in which the ability to target the machine resourcestoward issuing load and execute instructions is increased.

In an exemplary embodiment, one of the means to increase the ability toissue execute-class instructions is to provide the means to issue twoexecute instructions in a single-issue string. The execution unitactually requires several distinct resources, so by partitioning theseresources, the issue capability can be increased without increasing thenumber of functional units, other than the increased register file readand write ports. In an exemplary embodiment, the partitioning favoredplaces all instructions that involve shifting and shuffling in oneexecution unit, and all instructions that involve multiplication,including fixed-point and floating-point multiply and add in anotherunit. In an exemplary embodiment, resources used for implementing add,subtract, and bitwise logical operations may be duplicated, being modestin size compared to the shift and multiply units. In another exemplaryembodiment, resources used are shared between the two units, as theoperations have low-enough latency that two operations might bepipelined within a single issue cycle. These instructions must generallybe independent, except in another exemplary embodiment that two simpleadd, subtract, or bitwise logical instructions may be performeddependently, if the resources for executing simple instructions areshared between the execution units.

In an exemplary embodiment, one of the means to increase the ability toissue load-class instructions is to provide the means to issue two loadinstructions in a single-issue string. This would generally increase theresources required of the data fetch unit and the data cache, but acompensating solution is to steal the resources for the storeinstruction to execute the second load instruction. Thus, in anexemplary embodiment, a single-issue string can then contain either twoload instructions, or one load instruction and one store instruction,which uses the same register read ports and address computationresources as the basic 5-instruction string in another exemplaryembodiment.

In an exemplary embodiment, this capability also may be employed toprovide support for unaligned load and store instructions, where asingle-issue string may contain as an alternative a single unalignedload or store instruction which uses the resources of the two load-classunits in concert to accomplish the unaligned memory operation.

Always Reserved

This operation generates a reserved instruction exception.

Description

The reserved instruction exception is raised. Software may depend uponthis major operation code raising the reserved instruction exception inall implementations. The choice of operation code intentionally ensuresthat a branch to a zeroed memory area will raise an exception.

An exemplary embodiment of the Always Reserved instruction is shown inFIGS. 58A-58C.

Address

These operations perform calculations with two general register values,placing the result in a general register.

Description

The contents of registers rc and rb are fetched and the specifiedoperation is performed on these operands. The result is placed intoregister rd.

An exemplary embodiment of the Address instructions is shown in FIGS.59A-59C.

Address Compare

These operations perform calculations with two general register valuesand generate a fixed-point arithmetic exception if the conditionspecified is met.

Description

The contents of registers rd and rc are fetched and the specifiedcondition is calculated on these operands. If the specified condition istrue, a fixed-point arithmetic exception is generated. This instructiongenerates no general register results.

An exemplary embodiment of the Address Compare instructions is shown inFIGS. 60A-60C.

Address Copy Immediate

This operation produces one immediate value, placing the result in ageneral register.

Description

An immediate value is sign-extended from the 18-bit imm field. Theresult is placed into register rd.

An exemplary embodiment of the Address Copy Immediate instruction isshown in FIGS. 61A-61C.

Address Immediate

These operations perform calculations with one general register valueand one immediate value, placing the result in a general register.

Description

The contents of register rc is fetched, and a 64-bit immediate value issign-extended from the 12-bit imm field. The specified operation isperformed on these operands. The result is placed into register rd.

An exemplary embodiment of the Address Immediate instructions is shownin FIGS. 62A-62C.

Address Immediate Reversed

These operations perform calculations with one general register valueand one immediate value, placing the result in a general register.

Description

The contents of register rc is fetched, and a 64-bit immediate value issign-extended from the 12-bit imm field. The specified operation isperformed on these operands. The result is placed into register rd.

An exemplary embodiment of the Address Immediate Reversed instructionsis shown in 20 FIGS. 63A-63C.

Address Reversed

These operations perform calculations with two general register values,placing the result in a general register.

Description

The contents of registers rc and rb are fetched and the specifiedoperation is performed on these operands. The result is placed intoregister rd.

An exemplary embodiment of the Address Reversed instructions is shown inFIGS. 64A-64C.

Address Shift Left Immediate Add

These operations perform calculations with two general register values,placing the result in a general register.

Description

The contents of register rb are shifted left by the immediate amount andadded to the contents of register rc. The result is placed into registerrd.

An exemplary embodiment of the Address Shift Left Immediate Addinstruction is shown in FIGS. 65A-65C.

Address Shift Left Immediate Subtract

These operations perform calculations with two general register values,placing the result in a general register.

Description

The contents of register rc is subtracted from the contents of registerrb shifted left by the immediate amount. The result is placed intoregister rd.

An exemplary embodiment of the Address Shift Left Immediate Subtractinstruction is shown in FIGS. 66A-66C.

Address Shift Immediate

These operations perform calculations with one general register valueand one immediate value, placing the result in a general register.

Description

The contents of register rc is fetched, and a 6-bit immediate value istaken from the 6-bit simm field. The specified operation is performed onthese operands. The result is placed into register rd.

An exemplary embodiment of the Address Shift Immediate instructions isshown in 25 FIGS. 67A-67C.

Address Ternary

These operations perform calculations with three general registervalues, placing the result in a fourth general register.

Description

The contents of registers rd, rc, and rb are fetched. The specifiedoperation is performed on these operands. The result is placed intoregister ra.

An exemplary embodiment of the Address Ternary instruction is shown inFIGS. 68A-68C.

Branch

This operation branches to a location specified by a register.

Description

Execution branches to the address specified by the contents of registerrd.

Access disallowed exception occurs if the contents of register rd is notaligned on a quadlet boundary.

An exemplary embodiment of the Branch instruction is shown in FIGS.69A-69C.

Branch Back

This operation branches to a location specified by the previous contentsof register 0, reduces the current privilege level, loads a value frommemory, and restores register 0 to the value saved on a previousexception.

Description

Processor context, including program counter and privilege level isrestored from register 0, where it was saved at the last exception.Exception state, if set, is cleared, re-enabling normal exceptionhandling. The contents of register 0 saved at the last exception isrestored from memory. The privilege level is only lowered, so that thisinstruction need not be privileged.

If the previous exception was an AccessDetail exception, ContinuationState set at the time of the exception affects the operation of the nextinstruction after this Branch Back, causing the previous AccessDetailexception to be inhibited. If software is performing this instruction toabort a sequence ending in an AccessDetail exception, it should abort bybranching to an instruction that is not affected by Continuation State.

An exemplary embodiment of the Branch Back instruction is shown in FIGS.70A-70C.

Branch Barrier

This operation stops the current thread until all pending stores arecompleted, then branches to a location specified by a register.

Description

The instruction fetch unit is directed to cease execution until allpending stores are completed. Following the barrier, any previouslypre-fetched instructions are discarded and execution branches to theaddress specified by the contents of register rd.

Access disallowed exception occurs if the contents of register rd is notaligned on a quadlet boundary.

Self-modifying, dynamically-generated, or loaded code may require use ofthis instruction between storing the code into memory and executing thecode.

An exemplary embodiment of the Branch Barrier instruction is shown inFIGS. 71A-71C.

Branch Conditional

These operations compare two operands, and depending on the result ofthat comparison, conditionally branches to a nearby code location.

Description

The contents of registers rd and rc are compared, as specified by the opfield. If the result of the comparison is true, execution branches tothe address specified by the offset field. Otherwise, executioncontinues at the next sequential instruction.

Regarding footnote 1 in FIG. 72A, B.G.Z is encoded as B.L.0 with bothinstruction fields rd and rc equal.

Regarding footnote 2 in FIG. 72A, B.GE.Z is encoded as B.GE with bothinstruction fields rd and rc equal.

Regarding footnote 3 in FIG. 72A, B.L.Z is encoded as B.L with bothinstruction fields rd and rc equal.

Regarding footnote 4 in FIG. 72A, B.LE.Z is encoded as B.GE.0 with bothinstruction fields rd and rc equal.

An exemplary embodiment of the Branch Conditional instructions is shownin FIGS. 72A-72C.

Branch Conditional Floating-Point

These operations compare two floating-point operands, and depending onthe result of that comparison, conditionally branches to a nearby codelocation.

Description

The contents of registers rc and rd are compared, as specified by the opfield. If the result of the comparison is true, execution branches tothe address specified by the offset field. Otherwise, executioncontinues at the next sequential instruction.

An exemplary embodiment of the Branch Conditional Floating-Pointinstructions is shown in FIGS. 73A-73C.

Branch Conditional Visibility Floating-Point

These operations compare two group-floating-point operands, anddepending on the result of that comparison, conditionally branches to anearby code location.

Description

The contents of registers rc and rd are compared, as specified by the opfield. If the result of the comparison is true, execution branches tothe address specified by the offset field. Otherwise, executioncontinues at the next sequential instruction.

Each operand is assumed to represent a vertex of the form: [w z y x]packed into a single register. The comparisons check for visibility of aline connecting the vertices against a standard viewing volume, definedby the planes: x=w,x=−w,y=w,y=−w,z=0,z=1. A line is visible (V) if thevertices are both within the volume. A line is not visible (NV) iseither vertex is outside the volume—in such a case, the line may bepartially visible. A line is invisible (I) if the vertices are bothoutside any face of the volume. A line is not invisible (NI) if thevertices are not both outside any face of the volume.

An exemplary embodiment of the Branch Conditional VisibilityFloating-Point instructions is shown in FIGS. 74A-74C.

Branch Down

This operation branches to a location specified by a register, reducingthe current privilege level.

Description

Execution branches to the address specified by the contents of registerrd. The current privilege level is reduced to the level specified by thelow order two bits of the contents of register rd.

An exemplary embodiment of the Branch Down instruction is shown in FIGS.75A-75C.

Branch Gateway

This operation provides a secure means to call a procedure, includingthose at a higher privilege level.

Description

The contents of register rb is a branch address in the high-order 62bits and a new privilege level in the low-order 2 bits. A branch andlink occurs to the branch address, and the privilege level is raised tothe new privilege level. The high-order 62 bits of the successor to thecurrent program counter is catenated with the 2-bit current executionprivilege and placed in register 0.

If the new privilege level is greater than the current privilege level,an octlet of memory data is fetched from the address specified byregister 1, using the little-endian byte order and a gateway accesstype. A GatewayDisallowed exception occurs if the original contents ofregister 0 do not equal the memory data.

If the new privilege level is the same as the current privilege level,no checking of register 1 is performed.

An AccessDisallowed exception occurs if the new privilege level isgreater than the privilege level required to write the memory data, orif the old privilege level is lower than the privilege required toaccess the memory data as a gateway, or if the access is not aligned onan 8-byte boundary.

A Reservedlnstruction exception occurs if the rc field is not one or therd field is not zero.

In the example below, a gateway from level 0 to level 2 is illustrated.The gateway pointer, located by the contents of register rc (1), isfetched from memory and compared against the contents of register rb(0). The instruction may only complete if these values are equal.Concurrently, the contents of register rb (0) is placed in the programcounter and privilege level, and the address of the next sequentialaddress and privilege level is placed into register rd (0). Code at thetarget of the gateway locates the data pointer at an offset from thegateway pointer (register 1), and fetches it into register 1, making adata region available. Referring to FIG. 54H, stack pointer may be savedand fetched using the data region, another region located from the dataregion, or a data region located as an offset from the original gatewaypointer.

For additional information on the branch-gateway instruction, see theSystem and Privileged Library Calls section.

This instruction gives the target procedure the assurances that register0 contains a valid return address and privilege level, that register Ipoints to the gateway location, and that the gateway location is octletaligned. Register I can then be used to securely reach values in memory.If no sharing of literal pools is desired, register 1 may be used as aliteral pool pointer directly. If sharing of literal pools is desired,register 1 may be used with an appropriate offset to load a new literalpool pointer; for example, with a one cache line offset from theregister 1. Note that because the virtual memory system operates withcache line granularity, that several gateway locations must be createdtogether.

Software must ensure that an attempt to use any octlet within the regiondesignated by virtual memory as gateway either functions properly orcauses a legitimate exception. For example, if the adjacent octletscontain pointers to literal pool locations, software should ensure thatthese literal pools are not executable, or that by virtue of beingaligned addresses, cannot raise the execution privilege level. Ifregister I is used directly as a literal pool location, software mustensure that the literal pool locations that are accessible as a gatewaydo not lead to a security violation.

Register 0 contains a valid return address and privilege level, thevalue is suitable for use directly in the Branch-down (B.DOWN)instruction to return to the gateway callee.

An exemplary embodiment of the Branch Gateway instruction is shown inFIGS. 76A-76C.

Branch Halt

This operation stops the current thread until an exception occurs.

Description

This instruction directs the instruction fetch unit to cease executionuntil an exception occurs.

An exemplary embodiment of the Branch Halt instruction is shown in FIGS.77A-77C.

Branch Hint

This operation indicates a future branch location specified by aregister.

Description

This instruction directs the instruction fetch unit of the processorthat a branch is likely to occur count times at simm instructionsfollowing the current successor instruction to the address specified bythe contents of register rd.

After branching count times, the instruction fetch unit should presumethat the branch at simm instructions following the current successorinstruction is not likely to occur. If count is zero, this hint directsthe instruction fetch unit that the branch is likely to occur more than63 times.

Access disallowed exception occurs if the contents of register rd is notaligned on a quadlet boundary.

An exemplary embodiment of the Branch Hint instruction is shown in FIGS.78A-78C.

Branch Hint Immediate

This operation indicates a future branch location specified as an offsetfrom the program counter.

Description

This instruction directs the instruction fetch unit of the processorthat a branch is likely to occur count times at simm instructionsfollowing the current successor instruction to the address specified bythe offset field.

After branching count times, the instruction fetch unit should presumethat the branch at simm instructions following the current successorinstruction is not likely to occur. If count is zero, this hint directsthe instruction fetch unit that the branch is likely to occur more than63 times.

An exemplary embodiment of the Branch Hint Immediate instruction isshown in FIGS. 79A-79C.

Branch Immediate

This operation branches to a location that is specified as an offsetfrom the program counter.

Description

Execution branches to the address specified by the offset field.

An exemplary embodiment of the Branch Immediate instruction is shown inFIGS. 20 80A-80C.

Branch Immediate Link

This operation branches to a location that is specified as an offsetfrom the program counter, saving the value of the program counter intoregister 0.

Description

The address of the instruction following this one is placed intoregister 0. Execution branches to the address specified by the offsetfield.

An exemplary embodiment of the Branch Immediate Link instruction isshown in FIGS. 81A-81C.

Branch Link

This operation branches to a location specified by a register, savingthe value of the program counter into a register.

Description

The address of the instruction following this one is placed intoregister rd. Execution branches to the address specified by the contentsof register rc.

Access disallowed exception occurs if the contents of register rc is notaligned on a quadlet boundary.

Reserved instruction exception occurs if rb is not zero.

An exemplary embodiment of the Branch Link instruction is shown in FIGS.82A-82C.

Store Double Compare Swap

These operations compare two 64-bit values in a register against two64-bit values read from two 64-bit memory locations, as specified by two64-bit addresses in a register, and if equal, store two new 64-bitvalues from a register into the memory locations. The values read frommemory are catenated and placed in a register.

Description

Two virtual addresses are extracted from the low order bits of thecontents of registers rc and rb. Two 64-bit comparison values areextracted from the high order bits of the contents of registers rc andrb. Two 64-bit replacement values are extracted from the contents ofregister rd. The contents of memory using the specified byte order areread from the specified addresses, treated as 64-bit values, comparedagainst the specified comparison values, and if both read values areequal to the comparison values, the two replacement values are writtento memory using the specified byte order. If either are unequal, novalues are written to memory. The loaded values are catenated and placedin the register specified by rd.

The virtual addresses must be aligned, that is, it must be an exactmultiple of the size expressed in bytes. If the address is not alignedan“access disallowed by virtual address” exception occurs.

An exemplary embodiment of the Store Double Compare Swap instructions isshown in FIGS. 83A-83C.

Store Immediate Inplace

These operations add the contents of a register to a sign-extendedimmediate value to produce a virtual address, and store the contents ofa register into memory.

Description

A virtual address is computed from the sum of the contents of registerrc and the sign-extended value of the offset field. The contents ofmemory using the specified byte order are read and treated as a 64-bitvalue. A specified operation is performed between the memory contentsand the orginal contents of register rd, and the result is written tomemory using the specified byte order. The original memory contents areplaced into register rd.

The computed virtual address must be aligned, that is, it must be anexact multiple of the size expressed in bytes. If the address is notaligned an “access disallowed by virtual address” exception occurs.

An exemplary embodiment of the Store Immediate Inplace instructions isshown in FIGS. 84A-84C.

Store Inplace

These operations add the contents of two registers to produce a virtualaddress, and store the contents of a register into memory.

Description

A virtual address is computed from, the sum of the contents of registerrc and the contents of register rb multiplied by operand size. Thecontents of memory using the specified byte order are read and treatedas 64 bits. A specified operation is performed between the memorycontents and the original contents of register rd, and the result iswritten to memory using the specified byte order. The original memorycontents are placed into register rd.

The computed virtual address must be aligned, that is, it must be anexact multiple of the size expressed in bytes. If the address is notaligned an “access disallowed by virtual address” exception occurs.

An exemplary embodiment of the Store Inplace instructions is shown inFIGS. 85A-85C.

Group Add Halve

These operations take operands from two registers, perform operations onpartitions of bits in the operands, and place the concatenated resultsin a third register.

Description

The contents of registers rc and rb are partitioned into groups ofoperands of the size specified, added, halved, and rounded as specified,yielding a group of results, each of which is the size specified. Theresults never overflow, so limiting is not required by this operation.The group of results is catenated and placed in register rd.

Z (zero) rounding is not defined for unsigned operations, and aReservedlnstruction exception is raised if attempted. F (floor) roundingwill properly round unsigned results downward.

An exemplary embodiment of the Group Add Halve instructions is shown inFIGS. 15 86A-86C.

Group Copy Immediate

This operation copies an immediate value to a general register.

Description

A 128-bit immediate value is produced from the operation code, the sizefield and the 16-bit imm field. The result is placed into register ra.

An exemplary embodiment of the Group Copy Immediate instructions isshown in FIGS. 87A-87C.

Group Immediate

These operations take operands from a register and an immediate value,perform operations on partitions of bits in the operands, and place theconcatenated results in a second register.

Description

The contents of register rc is fetched, and a 128-bit immediate value isproduced from the operation code, the size field and the 10-bit immfield. The specified operation is performed on these operands. Theresult is placed into register ra.

An exemplary embodiment of the Group Immediate instructions is shown inFIGS. 88A-88C.

Group Immediate Reversed

These operations take operands from a register and an immediate value,perform operations on partitions of bits in the operands, and place theconcatenated results in a second register.

Description

The contents of register rc is fetched, and a 128-bit immediate value isproduced from the operation code, the size field and the 10-bit immfield. The specified operation is performed on these operands. Theresult is placed into register rd.

An exemplary embodiment of the Group Immediate Reversed instructions isshown in 15 FIGS. 89A-89C.

Group Inplace

These operations take operands from three registers, perform operationson partitions of bits in the operands, and place the concatenatedresults in the third register.

Description

The contents of registers rd, rc and rb are fetched. The specifiedoperation is performed on these operands. The result is placed intoregister rd.

Register rd is both a source and destination of this instruction.

An exemplary embodiment of the Group Inplace instruction is shown inFIGS. 90A-90C.

Group Shift Left Immediate Add

These operations take operands from two registers, perform operations onpartitions of bits in the operands, and place the concatenated resultsin a third register.

Description

The contents of registers rc and rb are partitioned into groups ofoperands of the size specified. Partitions of the contents of registerrb are shifted left by the amount specified in the immediate field andadded to partitions of the contents of register rc, yielding a group ofresults, each of which is the size specified. Overflows are ignored, andyield modular arithmetic results. The group of results is catenated andplaced in register rd.

An exemplary embodiment of the Group Shift Left Immediate Addinstructions is shown in FIGS. 91A-91 C.

Group Shift Left Immediate Subtract

These operations take operands from two registers, perform operations onpartitions of bits in the operands, and place the concatenated resultsin a third register.

Description

The contents of registers rc and rb are partitioned into groups ofoperands of the size specified. Partitions of the contents of registerrc are subtracted from partitions of the contents of register rb shiftedleft by the amount specified in the immediate field, yielding a group ofresults, each of which is the size specified. Overflows are ignored, andyield modular arithmetic results. The group of results is catenated andplaced in register rd.

An exemplary embodiment of the Group Shift Left Immediate Subtractinstructions is shown in FIGS. 92A-92C.

Group Subtract Halve

These operations take operands from two registers, perform operations onpartitions of bits in the operands, and place the concatenated resultsin a third register.

Description

The contents of registers rc and rb are partitioned into groups ofoperands of the size specified and subtracted, halved, rounded andlimited as specified, yielding a group of results, each of which is thesize specified. The group of results is catenated and placed in registerrd.

The result of this operation is always signed, whether the operands aresigned or unsigned.

An exemplary embodiment of the Group Subtract Halve instructions isshown in FIGS. 93A-93C.

Ensemble

These operations take operands from two registers, perform operations onpartitions of bits in the operands, and place the concatenated resultsin a third register.

Description

Two values are taken from the contents of registers rc and rb. Thespecified operation is performed, and the result is placed in registerrd.

An exemplary embodiment of the Ensemble instructions is shown in FIGS.94A-94C.

Ensemble Convolve Extract Immediate

These instructions take an address from a general register to fetch alarge operand from memory, a second operand from a general register,perform a group of operations on partitions of bits in the operands, andcatenate the results together, placing the result in a general register.

Description

The contents of registers rd and rc are catenated, as specified by theorder parameter, and used as a first value. A second value is thecontents of register rb. The values are partitioned into groups ofoperands of the size specified and are convolved, producing a group ofvalues. The group of values is rounded, and limited as specified,yielding a group of results which is the size specified. The group ofresults is catenated and placed in register rd.

Z (zero) rounding is not defined for unsigned extract operations, and aReservedinstruction exception is raised if attempted. F (floor) roundingwill properly round unsigned results downward.

The order parameter of the instruction specifies the order in which thecontents of registers rd and rc are catenated. The choice is significantbecause the contents of register rd is overwritten. When little-endianorder is specified, the contents are catenated so that the contents ofregister rc is most significant (left) and the contents of register rdis least significant (right). When big-endian order is specified, thecontents are catenated so that the contents of register rd is mostsignificant (left) and the contents of register rc is least significant(right).

An exemplary embodiment of the Ensemble Convolve Extract Immediateinstructions is shown in FIGS. 95A-95E.

Referring to FIG. 95D, an ensemble-convolve-extract-immediate-doubletsinstruction (ECON.X.I16, ECON.X.IM16, or ECON.X.IU16) convolves vector[x w v u t s r q p o n m l k j i] with vector [h g f e d c b a],yielding the products [ax+bw+cv+du+et+fs+gr+hq . . .as+br+cq+dp+eo+fn+gm+hl ar+bq+cp+do+en+fm+gl+hkaq+bp+co+dn+em+fl+gk+hj], rounded and limited as specified.

Referring to FIG. 95E, anensemble-convolve-extract-immediate-complex-doublets instruction(ECON.X.IC 16) convolves vector [x w v u t sr q p o n m l k j i] withvector [h g f e d c b a], yielding the products [ax+bw+cv+du+et+fs+gr+hq. . . as−bt+cq−dr+eo−fp+gm−hn ar+bq+cp+do+en+fm+gl+hkaq−br+co−dp+em−fn+gk+hl], rounded and limited as specified.

Ensemble Convolve Floating-Point

These instructions take an address from a general register to fetch alarge operand from memory, a second operand from a general register,perform a group of operations on partitions of bits in the operands, andcatenate the results together, placing the result in a general register.

The first value is the catenation of the contents of register rd and rc,as specified by the order parameter. A second value is the contents ofregister rb. The values are partitioned into groups of operands of thesize specified. The second values are multiplied with the first values,then summed, producing a group of result values. The group of resultvalues is catenated and placed in register rd.

An exemplary embodiment of the Ensemble Convolve Floating Pointinstructions is shown in FIGS. 96A-96E.

Referring to FIG. 96D, anensemble-convolve-floating-point-half-little-endian instruction(E.CON.F.16.L) convolves vector [x w v u t s r q p o n m l k j i] withvector [h gfedcb a], yielding the products [ax+bw+cv+du+et+fs+gr+hq . .. as+br+cq+dp+eo+fn+gm+hl ar+bq+cp+do+en+fm+gl+hkaq+bp+co+dn+em+fl+gk+hj].

Referring to FIG. 96E, anensemble-convolve-complex-floating-point-half-little-endian instruction(E.CON.C.F.16.L) convolves vector [x w v u t s r q p o n m l k j i] withvector [h g f e d c b a], yielding the products [ax+bw+cv+du+et+fs+gr+hq. . . as−bt+cq−dr+eo−fp+gm−hn ar+bq+cp+do+en+fin+gl+hkaq−br+co−dp+em−fn+gk+hl].

Ensemble Extract Immediate

These operations take operands from two registers and a short immediatevalue, perform operations on partitions of bits in the operands, andplace the concatenated results in a third register.

Description

The contents of registers rc and rb are partitioned into groups ofoperands of the size specified and multiplied, added or subtracted, orare catenated and partitioned into operands of twice the size specified.The group of values is rounded, and limited as specified, yielding agroup of results, each of which is the size specified. The group ofresults is catenated and placed in register rd.

For mixed-signed multiplies, the contents of register rc is signed, andthe contents of register rb is unsigned. The extraction operation andthe result of mixed-signed multiplies is signed.

Z (zero) rounding is not defined for unsigned extract operations, and aReservedlnstruction exception is raised if attempted. F (floor) roundingwill properly round unsigned results downward.

An exemplary embodiment of the Ensemble Extract Immediate instructionsis shown in 20 FIGS. 97A-97G.

Referring to FIG. 97D, an ensemble multiply extract immediate doubletsinstruction (E.MULXI.16 or E.MUL.X.I.U.16) multiplies operand [h gfedcba] by operand [p o n m l k j i], yielding the products [hp go fn em dlck bj ai], rounded and limited as specified.

Referring to FIG. 97E, another illustration of ensemble multiply extractimmediate doublets instruction (E.MUL.X.I.16 or E.MUL.X.I.U.16).

Referring to FIG. 97F, an ensemble multiply extract immediate complexdoublets instruction (E.MULXIC.16 or E.MUL.X.I.U.16) multiplies operand[h gfedcb a] by operand [p o n m l k j i], yielding the result [gp+hogo−hp en+fm em−fn cl+dk ck−dl aj+bi ai−bj], rounded and limited asspecified. Note that this instruction prefers an organization of complexnumbers in which the real part is located to the right (lower precision)of the imaginary part.

Referring to FIG. 97G, another illustration of ensemble multiply extractimmediate complex doublets instruction (E.IVIUL.X.I.C.16 orE.MUL.X.I.U.16).

Ensemble Extract Immediate Inplace

These operations take operands from two registers and a short immediatevalue, perform operations on partitions of bits in the operands, andplace the catenated results in a third register.

Description

The contents of registers rc and rb are partitioned into groups ofoperands of the size specified and multiplied, added or subtracted, orare catenated and partitioned into operands of twice the size specified.The contents of register rd are partitioned into groups of operands ofthe size specified and sign or zero ensemble and shifted as specified,then added to the group of values computed. The group of values isrounded, and limited as specified, yielding a group of results which isthe size specified. The group of results is catenated and placed inregister rd.

For mixed-signed multiplies, the contents of register rc is signed, andthe contents of register rb as unsigned. The extraction operation, thecontents of register rd, and the result of mixed-signed multiplies aresigned.

Z (zero) rounding is not defined for unsigned extract operations, and aReservedlnstruction exception is raised if attempted. F (floor) roundingwill properly round unsigned results downward.

An exemplary embodiment of the Ensemble Extract Immediate Inplaceinstruction is shown in FIGS. 98A-98G.

Referring to FIG. 98D, an ensemble multiply add extract immediatedoublets instruction (E.MUL.ADD.X.I.16 or E.MUL.ADD.X.I.U.16) multipliesoperand [h g f e d c b a) by operand [p onmlkj i], then adding [x w v ut s r q], yielding the products [hp+x go+w fn+v em+u dl+t ck+s bj+rai+q], rounded and limited as specified.

Referring to FIG. 98E, another illustration of ensemble multiply addextract immediate doublets instruction (E.MUL.ADDXI.16 orE.MUL.ADD.X.I.U.16).

An ensemble multiply add extract immediate complex doublets instruction(E.MUL.ADD.S.I.C.16 or G.MUL.ADD.X.I.U.16) multiplies operand [gfedcb a]by operand [p onmlkj i], then adding [x w vut sr q], yielding the result[gp+ho+x go−hp_w en_fm+v em−fn_u cl+dk+t ck-dl+s aj+bi+r ai−bj+q],rounded and limited as specified. Note that this instruction prefers anorganization of complex numbers in which the real part is located to theright (lower precision) of the imaginary part.

Referring to FIG. 98F, an ensemble multiply add extract immediatecomplex doublets instruction (E.MUL.ADD.X.I.C.16 or G.MUL.ADD.X.I.U.16)multiplies operand [h gfedcb a] by operand [p onmlkj i], then adding [xwvut sr q], yielding the result [gp+ho+x go−hp+w en+fm+v em−fn+u cl+dk+tck−dl+s aj+bi+r ai−bj+q], rounded and limited as specified. Note thatthis instruction prefers an organization of complex numbers in which thereal part is located to the right (lower precision) of the imaginarypart.:

Referring to FIG. 98G, another illustration of ensemble add multiplyextract immediate complex doublets instruction (E.MUL.ADD.X.I.C.16).

Ensemble Inplace

These operations take operands from three registers, perform operationson partitions of bits in the operands, and place the concatenatedresults in the third register.

Description

The contents of registers rd, rc and rb are fetched. The specifiedoperation is performed on these operands. The result is placed intoregister rd.

Register rd is both a source and destination of this instruction.

An exemplary embodiment of the Ensemble Inplace instructions is shown inFIGS. 25 99A-99C.

Wide Multiply Matrix

These instructions take an address from a general register to fetch alarge operand from memory, a second operand from a general register,perform a group of operations on partitions of bits in the operands, andcatenate the results together, placing the result in a general register.

Description

The contents of register rc is used as a virtual address, and a value ofspecified size is loaded from memory. A second value is the contents ofregister rb. The values are partitioned into groups of operands of thesize specified. The second values are multiplied with the first values,then summed, producing a group of result values. The group of resultvalues is catenated and placed in register rd.

The memory-multiply instructions (W.MUL.MAT, W.MUL.MAT.C, W.MUL.MAT.M,W.MUL.MAT.P, W.MUL.MAT.U) perform a partitioned array multiply of up to8192 bits, that is 64×128 bits. The width of the array can be limited to64, 32, or 16 bits, but not smaller than twice the group size, by addingone-half the desired size in bytes to the virtual address operand: 4, 2,or 1. The array can be limited vertically to 128, 64, 32, or 16 bits,but not smaller than twice the group size, by adding one-half thedesired memory operand size in bytes to the virtual address operand.

The virtual address must either be aligned to 1024/gsize bytes (or512/gsize for W.MUL.MAT.C) (with gsize measured in bits), or must be thesum of an aligned address and one-half of the size of the memory operandin bytes and/or one-quarter of the size of the result in bytes. Analigned address must be an exact multiple of the size expressed inbytes. If the address is not valid an“access disallowed by virtualaddress” exception occurs.

An exemplary embodiment of the Wide Multiply Matrix instructions isshown in FIGS. 100A-100E.

A wide-multiply-octlets instruction (W.MUL.MAT.type.64, type=NONE M U P)is not implemented and causes a reserved instruction exception, as anensemble-multiply-sum-octlets instruction (E.MUL.SUM.type.64) performsthe same operation except that the multiplier is sourced from a 128-bitregister rather than memory. Similarly, instead of wide-multiplycomplex-quadlets instruction (W.MUL.MAT.C.32), one should use anensemble-multiply—complex-quadlets instruction (E.MUL.SUM.C.32).

Referring to FIG. 100D, a wide-multiply-doublets instruction (W.MUL.MAT,W.MUL.MAT.M, W.MUL.MAT.P, W.MUL.MAT.U) multiplies memory [m31 m30 . . .m1 m0] with vector [h g f e d c b a], yielding products [hm31+gm27+ . .. +bm7+am3 . . . hm28+gm24+ . . . +bm4+am0].

Referring to FIG. 100E, a wide-multiply-matrix-complex-doubletsinstruction (W.MUL.MAT.C) multiplies memory [m 15 m 14 . . . m 1 m0]with vector [h g f e d c b a], yielding products [hm 14+gm 15+ . . .+bm2+am3 . . . hm12+gm 13+ . . . +bm0+am1−hm 13+gm 12+ . . . −bm 1+am0].

Wide Multiply Matrix Extract

These instructions take an address from a general register to fetch alarge operand from memory, a second operand from a general register,perform a group of operations on partitions of bits in the operands, andcatenate the results together, placing the result in a general register.

Description

The contents of register rc is used as a virtual address, and a value ofspecified size is loaded from memory. A second value is the contents ofregister rd. The group size and other parameters are specified from thecontents of register rb. The values are partitioned into groups ofoperands of the size specified and are multiplied and summed, producinga group of values. The group of values is rounded, and limited asspecified, yielding a group of results which is the size specified. Thegroup of results is catenated and placed in register ra.

NOTE: The size of this operation is determined from the contents ofregister rb The multiplier usage is constant, but the memory operandsize is inversely related to the group size. Presumably this can bechecked for cache validity.

We also use low order bits of rc to designate a size, which must beconsistent with the group size. Because the memory operand is cached,the size can also be cached, thus eliminating the time required todecode the size, whether from rb or from rc.

The wide-multiply-matrix-extract instructions (W.MUL.MAT.X.B,W.MUL.MAT.X.L) perform a partitioned array multiply of up to 16384 bits,that is 128×128 bits. The width of the array can be limited to 128, 64,32, or 16 bits, but not smaller than twice the group size, by addingone-half the desired size in bytes to the virtual address operand: 8, 4,2, or 1. The array can be limited vertically to 128, 64, 32, or 16 bits,but not smaller than twice the group size, by adding one-half thedesired memory operand size in bytes to the virtual address operand.

Bits 31..0 of the contents of register rb specifies several parameterswhich control the manner in which data is extracted. The position anddefault values of the control fields allows for the source position tobe added to a fixed control value for dynamic computation, and allowsfor the lower 16 bits of the control field to be set for some of thesimpler extract cases by a single GCOPYI instruction.

The table below describes the meaning of each label:

label bits meaning fsize 8 field size dpos 8 destination position x 1reserved s 1 signed vs. unsigned n 1 complex vs. real multiplication m 1mixed-sign vs. same-sign multiplication l 1 saturation vs. truncationrnd 2 rounding gssp 9 group size and source position

The 9-bit gssp field encodes both the group size, gsize, and sourceposition, spos, according to the formula gssp=512-4*gsize+spos. Thegroup size, gsize, is a power of two in the range 1..128. The sourceposition, spos, is in the range 0..(2*gsize)−1.

The values in the s, n, m, t, and rnd fields have the following meaning:

values s n m l rnd 0 unsigned real same-sign truncate F 1 signed complexmixed-sign saturate Z 2 N 3 C

The virtual address must be aligned, that is, it must be an exactmultiple of the operand size expressed in bytes. If the address is notaligned an“access disallowed by virtual address” exception occurs.

Z (zero) rounding is not defined for unsigned extract operations, and aReservedlnstruction exception is raised if attempted. F (floor) roundingwill properly round unsigned results downward.

An exemplary embodiment of the Wide Multiply Matrix Extract instructionsis shown in FIGS. 101 A-101 E

Referring to FIG. 101D, a wide-multiply-matrix-extract-doubletsinstruction (W.MUL.MAT.X.B or W.MUL.MAT.X.L) multiplies memory [m63 m62m61 . . . m2 m1 m0] with vector [h g f e d c b a], yielding the products[am7+bm 15+cm23+dm31+em39+fin47+gm55+hm63 . . . am2+bm 1 0+cm18+dm26+em34+fin42+gm50+hm58 am 1+bm9+cm 17+dm25+em33+fm41+gm49+hm57am0+bm8+cm16+dm24+em32+fm40+gm48+hm56], rounded and limited asspecified.

Referring to FIG. 101 E, a wide-multiply-matrix-extract-complex-doubletsinstruction (W.MUL.MAT.X with n set in rb) multiplies memory [m31 m30m29 . . . m2 m1 m0] with vector [h g f e d c b a], yielding the products[am7+bm6+cm 1 5+dm 14+em23+fm22+gm31+hm30 . . . 20 am2−bm3+cm 1 0−dm 11+em 18−fm 19+gm26−hm27 am 1+bm0+cm9+dm8+em 17+fm 16+gm25+hm24 am0−bm1+cm8-drn9+em 16-fl 7+gm24−hm25], rounded and limited as specified.

Wide Multiply Matrix Extract Immediate

These instructions take an address from a general register to fetch alarge operand from memory, a second operand from a general register,perform a group of operations on partitions of bits in the operands, andcatenate the results together, placing the result in a general register.

Description

The contents of register rc is used as a virtual address, and a value ofspecified size is loaded from memory. A second value is the contents ofregister rb. The values are partitioned into groups of operands of thesize specified and are multiplied and summed, or are convolved,producing a group of sums. The group of sums is rounded, and limited asspecified, yielding a group of results, each of which is the sizespecified. The group of results is catenated and placed in register rd.

The wide-multiply-extract-immediate-matrix instructions (W.MUL.MAT.X.I,W.MUL.MAT.X.I.U, W.MUL.MAT.X.I.M, W.MUL.MAT.X.I.C) perform a partitionedarray multiply of up to 16384 bits, that is 128×128 bits. The width ofthe array can be limited to 128, 64, 32, or 16 bits, but not smallerthan twice the group size, by adding one-half the desired size in bytesto the virtual address operand: 8, 4, 2, or 1. The array can be limitedvertically to 128, 64, 32, or 16 bits, but not smaller than twice thegroup size, by adding one-half the desired memory operand size in bytesto the virtual address operand.

The virtual address must either be aligned to 2048/gsize bytes (or 10241gsize for W.MUL.MAT.X.I.C), or must be the sum of an aligned address andone-half of the size of the memory operand in bytes and/or one-half ofthe size of the result in bytes. An aligned address must be an exactmultiple of the size expressed in bytes. If the address is not valid an“access disallowed by virtual address” exception occurs.

Z (zero) rounding is not defined for unsigned extract operations, and aReservedlnstruction exception is raised if attempted. F (floor) roundingwill properly round unsigned results downward.

An exemplary embodiment of the Wide Multiply Matrix Extract Immediateinstructions is shown in FIGS. 102A-102E

Referring to FIG. 102D, awide-multiply-extract-immediate-matrix-doublets instruction(W.MUL.MAT.X.I.16 or W.MUL.MAT.X.I.U.16) multiplies memory [m63 m62 m61. . . m2 m1 m0] with vector [h g f e d c b a], yielding the products[am7+bm 15+cm23+dm31+em39+fm47+gm55+hm63 . . . am2+bm 1 0+cm18+dm26+em34+fm42+gm50+hm58 am 1+bm9+cm 17+dm25+em33+fm41+gm49+hm57am0+bm8+cm16+dm24+em32+fm40+gm48+hm56], rounded and limited asspecified.

Referring to FIG. 102E, awide-multiply-matrix-extract-immediate-complex-doublets instruction(W.MUL.MAT.X.I.C.16) multiplies memory [m31 m30 m29 . . . m2 m1 m0] withvector [h g f e d c b a], yielding the products [am7+bm6+cm 1 5+dm 14+em23+fm22+gm3 1+hm30 . . . am2−bm3+cm 10−dm 11+em 18−fm19+gm26−hm27am1+bm0+cm9+dm8+em17+fm16+gm25+hm24 am0−bm1+cm8−dm9+em16−f17+gm24−hm251,rounded and limited as specified.

Wide Multiply Matrix Floating-Point

These instructions take an address from a general register to fetch alarge operand from memory, a second operand from a general register,perform a group of operations on partitions of bits in the operands, andcatenate the results together, placing the result in a generalregister..

Description

The contents of register rc is used as a virtual address, and a value ofspecified size is loaded from memory. A second value is the contents ofregister rb. The values are partitioned into groups of operands of thesize specified. The second values are multiplied with the first values,then summed, producing a group of result values. The group of resultvalues is catenated and placed in register rd.

The wide-multiply-matrix-floating-point instructions (W.MUL.MAT.F,W.MUL.MAT.C.F) perform a partitioned array multiply of up to 16384 bits,that is 128×128 bits. The width of the array can be limited to 128, 64,32 bits, but not smaller than twice the group size, by adding one-halfthe desired size in bytes to the virtual address operand: 8, 4, or 2.The array can be limited vertically to 128, 64, 32, or 16 bits, but notsmaller than twice the group size, by adding one-half the desired memoryoperand size in bytes to the virtual address operand.

The virtual address must either be aligned to 2048/gsize bytes (or1024/gsize for W.MUL.MAT.C.F), or must be the sum of an aligned addressand one-half of the size of the memory operand in bytes and/or one-halfof the size of the result in bytes. An aligned address must be an exactmultiple of the size expressed in bytes. If the address is not valid an“access disallowed by virtual address” exception occurs.

An exemplary embodiment of the Wide Multiply Matrix Floating-Pointinstructions is shown in FIGS. 103A-103E.

Referring to FIG. 103D, a wide-multiply-matrix-floating-point-halfinstruction (W.MUL.MAT.F) multiplies memory [m31 m30 m 1 mO] with vector[h g f e d c b a], yielding products [lun31+gm27+ . . . +bm7+am3hm28+gm24+ . . . +bm4+am0].

Referring to FIG. 103E, awide-multiply-matrix-complex-floating-point-half instruction(W.MUL.MAT.F) multiplies memory [m15 m14 m 1 mO] with vector [h gfedcba], yielding products [hm14+gm15+ . . . +bm2+am3 fun 12+gm13+ . . .+bm0+am1−lun13+gm12+ . . . −bm1+am0].

Wide Multiply Matrix Galois

These instructions take an address from a general register to fetch alarge operand from memory, second and third operands from generalregisters, perform a group of operations on partitions of bits in theoperands, and catenate the results together, placing the result in ageneral register.

Description

The contents of register rc is used as a virtual address, and a value ofspecified size is loaded from memory. Second and third values are thecontents of registers rd and rb. The values are partitioned into groupsof operands of the size specified. The second values are multiplied aspolynomials with the first value, producing a result which is reduced tothe Galois field specified by the third value, producing a group ofresult values. The group of result values is catenated and placed inregister ra.

The wide-multiply-matrix-Galois instruction (W.MUL.MAT.G) performs apartitioned array multiply of up to 16384 bits, that is 128×128 bits.The width of the array can be limited to 128, 64, 32, or 16 bits, butnot smaller than twice the group size of 8 bits, by adding one-half thedesired size in bytes to the virtual address operand: 8, 4, 2, or 1. Thearray can be limited vertically to 128, 64, 32, or 16 bits, but notsmaller than twice the group size of 8 bits, by adding one-half thedesired memory operand size in bytes to the virtual address operand.

The virtual address must either be aligned to 256 bytes, or must be thesum of an aligned address and one-half of the size of the memory operandin bytes and/or one-half of the size of the result in bytes. An alignedaddress must be an exact multiple of the size expressed in bytes. If theaddress is not valid an “access disallowed by virtual address” exceptionoccurs.

An exemplary embodiment of the Wide Multiply Matrix Galois instructionsis shown in FIGS. 104A-104D

Referring to FIG. 104D, a wide-multiply-matrix-Galois instruction(W.MUL.MAT.G) multiplies memory [m255 m254 m 1 m0] with vector [p o n ml k j i h g fed c b a], reducing the result modulo polynomial [q],yielding products [(pm255+om247+ . . . +bm31+am15 mod q) (pm254+om246+ .. . +bm30+am14 mod q) (pm248+om240+ . . . +bm16+am0 mod q)].

Wide Switch

These instructions take an address from a general register to fetch alarge operand from memory, a second operand from a general register,perform a group of operations on partitions of bits in the operands, andcatenate the results together, placing the result in a general register.

Description

The contents of register rc is specifies as a virtual address andoptionally an operand size, and a value of specified size is loaded frommemory. A second value is the catenated contents of registers rd and rb.Eight corresponding bits from the memory value are used to select asingle result bit from the second value, for each corresponding bitposition. The group of results is catenated and placed in register ra.

The virtual address must either be aligned to 128 bytes, or must be thesum of an aligned address and one-half of the size of the memory operandin bytes. An aligned address must be an exact multiple of the sizeexpressed in bytes. The size of the memory operand must be 8, 16, 32,64, or 128 bytes. If the address is not valid an “access disallowed byvirtual address” exception occurs. When a size smaller than 128 bits isspecified, the high order bits of the memory operand are replaced withvalues corresponding to the bit position, so that the same memoryoperand specifies a bit selection within symbols of the operand size,and the same operation is performed on each symbol.

An exemplary embodiment of the Wide Switch instructions is shown inFIGS. 105A-105C.

Wide Translate

These instructions take an address from a general register to fetch alarge operand from memory, a second operand from a general register,perform a group of operations on partitions of bits in the operands, andcatenate the results together, placing the result in a general register.

Description

The contents of register rc is used as a virtual address, and a value ofspecified size is loaded from memory. A second value is the contents ofregister rb. The values are partitioned into groups of operands of asize specified. The low-order bytes of the second group of values areused as addresses to choose entries from one or more tables constructedfrom the first value, producing a group of values. The group of resultsis catenated and placed in register rd.

By default, the total width of tables is 128 bits, and a total tablewidth of 128, 64, 32, 16 or 8 bits, but not less than the group size maybe specified by adding the desired total table width in bytes to thespecified address: 16, 8, 4, 2, or 1. When fewer than 128 bits arespecified, the tables repeat to fill the 128 bit width.

The default depth of each table is 256 entries, or in bytes is 32 timesthe group size in bits. An operation may specify 4, 8, 16, 32, 64, 128or 256 entry tables, by adding one-half of the memory operand size tothe address. Table index values are masked to ensure that only thespecified portion of the table is used. Tables with just 2 entriescannot be specified; if 2-entry tables are desired, it is recommended toload the entries into registers and use G.MUX to select the tableentries.

Failing to initialize the entire table is a potential security hole, asan instruction in with a small-depth table could access table entriespreviously initialized by an instruction with a large-depth table. Wecould close this hole either by initializing the entire table, even ifextra cycles are required, or by masking the index bits so that only theinitialized portion of the table is used. Initializing the entire tablewith no penalty in cycles could require writing to as many as 128entries at once, which is quite likely to cause circuit complications.Initializing the entire table with writes to only one entry at a timerequires writing 256 cycles, even when the table is smaller. Masking theindex bits is the preferred solution.

Masking the index bits suggests that this instruction, for tables largerthan 256 entries, may be useful for a general-purpose memory translatefunction where the processor performs enough independent load operationsto fill the 128 bits. Thus, the 16, 32, and 64 bit versions of thisfunction perform equivalent of 8, 4, 2 withdraw, 8, 4, or 2 load-indexedand 7, 3, or 1 group-extract instructions. In other words, thisinstruction can be as powerful as 23, 11, or 5 existing instructions.The 8-bit version is a single-cycle operation replacing 47 existinginstructions, so these are not as big a win, but nonetheless, this is atleast a 50% improvement on a 2-issue processor, even with one-cycle-perload timing. To make this possible, the default table size would become65536, 232 and 264 for 16, 32 and 64-bit versions of the instruction.

For the big-endian version of this instruction, in the definition below,the contents of register rb is complemented. This reflects a desire toorganize the table so that the lowest addressed table entries areselected when the index is zero. In the logical implementation,complementing the index can be avoided by loading the table memorydifferently for big-endian and little-endian versions. A consequence ofthis shortcut is that a table loaded by a big-endian translateinstruction cannot be used by a little-endian translate instruction, andvice-versa.

The virtual address must either be aligned to 4096 bytes, or must be thesum of an 20 aligned address and one-half of the size of the memoryoperand in bytes and/or the desired total table width in bytes. Analigned address must be an exact multiple of the size expressed inbytes. The size of the memory operand must be a power of two from 4 to4096 bytes, but must be at least 4 times the group size and 4 times thetotal table width. If the address is not valid an “access disallowed byvirtual address” exception occurs.

An exemplary embodiment of the Wide Translate instructions is shown inFIGS. 106A-106C.

Bus Interface

According to one embodiment of the invention, an initial implementationof the processor uses a “Super Socket 7 compatible” (SS7) bus interface,which is generally similar to and compatible with other “Socket 7” and“Super Socket 7” processors such as the Intel Pentium, Pentium with MMXTechnology; AMD K6, K6-II, K6-III; IDT Winchip C6, 2, 2A, 3, 4; Cyrix6×86, etc. and other “Socket 7” chipsets listed below.

The SS7 bus interface behavior is quite complex, but well-known due tothe leading position of the Intel Pentium design. This document does notyet contain all the detailed information related to this bus, and willconcentrate on the differences between the Zeus SS7 bus and otherdesigns. For functional specification and pin interface behavior, thePentium Processor Family Developer's Manual is a primary reference. For100 MHz SS7 bus timing data, the AMD K6-2 Processor Data Sheet is aprimary reference.

Motherboard Chipsets

The following motherboard chipsets are designed for the 100 MHz “Socket7” bus:

clock North South Manufacturer Website Chipset rate bridge bridge VIAtechnologies, Inc. www.via.com.tw Apollo MVP3 100 MHz vt82c598atvt82c598b Silicon Integrated Systems www.sis.com.tw SiS 5591/5592  75MHz SiS 5591 SiS 5595 Acer Laboratories, Inc. www.acerlabs.com AliAladdin V 100 MHz M1541 M1543C

The following processors are designed for a “Socket 7” bus:

Manufacturer Website Chips clock rate Advanced Micro Devices www.amd.comK6-2 100 MHz Advanced Micro Devices www.amd.com K6-3 100 MHz Intelwww.intel.com Pentium  66 MHz MMX IDT/Centaur www.winchip.com Winchip C6 75 MHz IDT/Centaur www.winchip.com Winchip 2 100 MHz IDT/Centaurwww.winchip.com Winchip 2A 100 MHz IDT/Centaur www.winchip.com Winchip 4100 MHz NSM/Cyrix www.cyrix.com

Pinout

In FIG. 57, signals which are different from Pentium pinout, areindicated by italics and underlining. Generally, otherPentium-compatible processors (such as the AMD K6-2) define thesesignals.

FIG. 48 is a pin summary describing the functions of various pins inaccordance with the present embodiment.

Electrical Specifications

FIGS. 49A-G contain electrical specifications describing AC and DCparameters in accordance with the present embodiment. These preliminaryelectrical specifications provide AC and DC parameters that are requiredfor “Super Socket 7” compatibility.

Bus Control Register

The Bus Control Register provides direct control of Emulator signals,selecting output states and active input states for these signals.

The layout of the Bus Control Register is designed to match theassignment of signals to the Event Register.

number control  0 Reserved  1 A20M# active level  2 BF0 active level  3BF1 active level  4 BF2 active level  5 BUSCHK active level  6 FLUSH#active level  7 FRCMC# active level  8 IGNNE# active level  9 INITactive level 10 INTR active level 11 NMI active level 12 SMI# activelevel 13 STPCLK# active level 14 CPUTYP active at rest 15 DPEN# activeat rest 16 FLUSH# active at rest 17 INIT active at rest 31 . . . 18Reserved 32 Bus lock 33 Split cycle 34 BP0 output 35 BP1 output 36 BP2output 37 BP3 output 38 FERR# output 39 IERR# output 40 PM0 output 41PM1 output 42 SMIACT# output 63 . . . 43 Reserved

Emulator Signals

Several of the signals, A20M#, INIT, NMI, SMI#, STPCLK#, IGNNE# areinputs that have purposes primarily defined by the needs of x86processor emulation. They have no direct purpose in the Zeus processor,other than to signal an event, which is handled by software. Each ofthese signals is an input sampled on the rising edge of each bus clock,if the input signal matches the active level specified in the buscontrol register, the corresponding bit in the event register is set.The bit in the event register remains set even if the signal is nolonger active, until cleared by software. If the event register bit iscleared by software, it is set again on each bus clock that the signalis sampled active.

A20M#

A20M# (address bit 20 mask inverted), when asserted (low), directs anx86 emulator to generate physical addresses for which bit 20 is zero.

The A20M# bit of the bus control register selects which level of theA20M# signal will generate an event in the A20M# bit of the eventregister. Clearing (to 0) the A20M# bit of the bus control register willcause the A20M# bit of the event register to be set when the A20M#signal is asserted (low).

Asserting the A20M# signal causes the emulator to modify all current TBmappings to produce a zero value for bit 20 of the byte address. TheA20M# bit of the bus control register is then set (to 1) to cause theA20M# bit of the event register to be set when the A20M# signal isreleased (high).

Releasing the A20M# signal causes the emulator to restore the TB mappingto the original state. The A20M# bit of the bus control register is thencleared (to 0) again, to cause the A20M# bit of the event register to beset when the A20M# signal is asserted (low).

INIT

INIT (initialize) when asserted (high), directs an x86 emulator to beginexecution of the external ROM BIOS.

The INIT bit of the bus control register is normally set (to 1) to causethe INIT bit of the event register to be set when the INIT signal isasserted (high).

INTR

INTR (maskable interrupt) when asserted (high), directs an x86 emulatorto simulate a maskable interrupt by generating two locked interruptacknowlege special cycles. External hardware will normally release theINTR signal between the first and second interrupt acknowlege specialcycle.

The INTR bit of the bus control register is normally set (to 1) to causethe INTR bit of the event register to be set when the INTR signal isasserted (high).

NMI

NMI (non-maskable interrupt) when asserted (high), directs an x86emulator to simulate a non-maskable interrupt. External hardware willnormally release the NMI signal.

The NMI bit of the bus control register is normally set (to 1) to causethe NMI bit of the event register to be set when the NMI signal isasserted (high).

SMI#

SMI# (system management interrupt inverted) when asserted (low), directsan x86 emulator to simulate a system management interrupt by flushingcaches and saving registers, and asserting (low) SMIACT# (systemmanagement interrupt active inverted). External hardware will normallyrelease the SMI#.

The SMI# bit of the bus control register is normally cleared (to 0) tocause the SMI# bit of the event register to be set when the SMI# signalis asserted (low).

STPCLK#

STPCLK# (stop clock inverted) when asserted (low), directs an x86emulator to simulate a stop clock interrupt by flushing caches andsaving. registers, and performing a stop grant special cycle.

The STPCLK# bit of the bus control register is normally cleared (to 0)to cause the STPCLK# bit of the event register to be set when theSTPCLK# signal is asserted (low).

Software must set (to 1) the STPCLK# bit of the bus control register tocause the STPCLK# bit of the event register to be set when the STPCLK#signal is released (high) to resume execution. Software must ceaseproducing bus operations after the stop grant special cycle. Usually,software will use the B.HALT instruction in all threads to ceaseperforming operations. The processor PLL continues to operate, and theprocessor must still sample INIT, INTR, RESET, NMI, SMI# (to place themin the event register) and respond to RESET and inquire and snooptransactions, so long as the bus clock continues operating.

The bus clock itself cannot be stopped until the stop grant specialcycle. If the bus clock is stopped, it must stop in the low (0) state.The bus clock must be operating at frequency for at least 1 ms beforereleasing STPCLK# or releasing RESET. While the bus clock is stopped,the processor does not sample inputs or responds to RESET or inquire orsnoop transactions.

External hardware will normally release STPCLK# when it is desired toresume execution. The processor should respond to the STPCLK# bit in theevent register by awakening one or more threads.

IGNNE#

IGNNE# (address bit 20 mask inverted), when asserted (low), directs anx86 emulator to ignore numeric errors.

The IGNNE# bit of the bus control register selects which level of theIGNNE# signal will generate an event in the IGNNE# bit of the eventregister. Clearing (to 0) the IGNNE# bit of the bus control registerwill cause the IGNNE# bit of the event register to be set when theIGNNE# signal is asserted (low).

Asserting the IGNNE# signal causes the emulator to modify its processingto ignore numeric errors, if suitably enabled to do so. The IGNNE# bitof the bus control register is then set (to 1) to cause the IGNNE# bitof the event register to be set when the IGNNE# signal is released(high).

Releasing the IGNNE# signal causes the emulator to restore the emulationto the original state. The IGNNE# bit of the bus control register isthen cleared (to 0) again, to cause the IGNNE# bit of the event registerto be set when the IGNNE# signal is asserted (low).

Emulator Output Signals

Several of the signals, BP3.BPO, FERR#, IERR#, PM1.PM0, SMIACT# areoutputs that have purposes primarity defined by the needs of x86processor emulation. They are driven from the bus control register thatcan be written by software.

Bus Snooping

Zeus support the “Socket 7” protocols for inquiry, invalidation andcoherence of cache lines. The protocols are implemented in hardware anddo not interrupt the processor as a result of bus activity. Cache accesscycles may be “stolen” for this purpose, which may delay completion ofprocessor memory activity.

Definition

  def SnoopPhysicaBus as  //wait for transaction on bus or inquiry cycle do   wait  while BRDY# = 0  pa_(31..3) ← A_(31..3)  op ← W/R# ? W : R cc ← CACHE# ∥ PWT ∥ PCD enddef

Locked Cycles

Locked cycles occur as a result of synchronization operations(Store-swap instructions) performed by the processor. For x86 emulation,locked cycles also occur as a result of setting specific memory-mappedcontrol registers.

Locked Synchronization Instruction

Bus lock (LOCK#) is asserted (low) automatically as a result ofstore-swap instructions 30 that generate bus activity, which alwaysperform locked read-modify-write cycles on 64 bits of data. Note thatstore-swap instructions that are performed on cache sub-blocks that arein the E or M state need not generate bus activity.

Locked Sequences of Bus Transactions

Bus lock (LOCK#) is also asserted (low) on subsequent bus transactionsby writing a one (1) to the bus lock bit of the bus control register.Split cycle (SCYC) is similarly asserted (high) if a one (1) is alsowritten to the split cycle bit of the bus emulation control register.

All subsequent bus transactions will be performed as a locked sequenceof transactions, asserting bus lock (LOCK# low) and optionally splitcycle (SCYC high), until zeroes (0) are written to the bus lock andsplit cycle bits of the bus control register. The next bus transactioncompletes the locked sequence, releasing bus lock (LOCK# high) and splitcycle (SCYC low) at the end of the transaction. If the lockedtransaction must be aborted because of bus activity such as backoff, alock broken event is signalled and the bus lock is released.

Unless special care is taken, the bus transactions of all threads occuras part of the locked sequence of transactions. Software can do so byinterrupting all other threads until the locked sequence is completed.Software should also take case to avoid fetching instructions during thelocked sequence, such as by executing instructions out of niche or ROMmemory.

Software should also take care to avoid terminating the sequence withevent handling prior to releasing the bus lock, such as by executing thesequence with events disabled (other than the lock broken event).

The purpose of this facility is primarily for x86 emulation purposes, inwhich we are willing to perform acts (such as stopping all the otherthreads) in the name of compatibility. It is possible to take specialcare in hardware to sort out the activity of other threads, and breakthe lock in response to events. In doing so, the bus unit must defer busactivity generated by other threads until the locked sequence iscompleted. The bus unit should inhibit event handling while the bus islocked.

Sampled at Reset

Certain pins are sampled at reset and made available in the eventregister.

CPUTYP Primary or Dual processor PICDOpPENM Dual processing enableFLUSH# Tristate test mode INIT Built-in self-test

Sampled per Clock

Certain pins are sampled per clock and changes are made available in theevent register.

A20M# address bit 20 mask BF[1:0] bus frequency BUSCHK# bus check FLUSH#cache flush request FRCMC# functional redundancy check - not implementedon Pentium MMX IGNNE# ignore numeric error INIT re-initialize pentiumprocessor INTR external interrupt NMI non-maskable interrupt R/S#run/stop SMI# system management STPCLK# stop clock

Bus Access

The “Socket 7” bus performs transfers of 1-8 bytes within an octletboundary or 32 bytes on a triclet boundary.

Transfers sized at 16 bytes (hexlet) are not available as a singletransaction, they are 20 performed as two bus transactions.

Bus transactions begin by gaining control of the bus (TODO: not shown),and in the initial cycle, asserting ADS#, M/10#, A, BE#, W/R#, CACHE#,PWT, and PCD. These signals indicate the type, size, and address of thetransaction. One or more octiets of data are returned on a read (theexternal system asserts BRDY# and/or NA# and D), or accepted on a write(TODO not shown).

The external system is permitted to affect the cacheability andexclusivity of data returned to the processor, using the KEN# and WB/WT#signals.

Definition

def data,cen ← AccessPhysicaBus(pa,size,cc,op,wd) as  // dividetransfers sized between octlet and hexlet into two parts  // also dividetransfers which cross octlet boundary into two parts  if (64<size≦128)or ((size<64) and (size+8*pa_(2..0)>64)) then   data0,cen ←AccessPhysicalBus(pa,64-8*pa_(2..0),cc,op,wd)   if cen=0 then     pa1 ←pa_(63..4)∥1∥0³     data1,cen ←AccessPhysicalBus(pa1,size+8*pa_(2..0)-64,cc,op,wd)     data ←data1_(127..64) ∥ data0_(63..0)   endif  else   ADS# ← 0   M/IO# ← 1  A_(31..3) ← pa_(31..3)   for i ← 0 to 7    BE_(i)# ←pa_(2..0) ≦ i <pa_(2..0)+size/8   endfor   W/R# ← (op = W)   if (op=R) then    CACHE# ← 

 (cc ≧ WT)    PWT ← (cc = WT)    PCD ← (cc ≦ CD)    do     wait    while(BRDY# = 1) and (NA# = 1)    //Intel spec doesn't say whether KEN#should be ignored if no    CACHE#    //AMD spec says KEN# should beignored if no CACHE#    cen ←  

 KEN# and (cc ≧ WT) //cen=1 if triclet is cacheable    xen ← WB/WT# and(cc ≠ WT) //xen=1 if triclet is exclusive    if cen then     os ←64*pa_(4..3)     data_(63+os..os) ← D_(63..0)     do      wait     whileBRDY# = 1      

      do      wait     while BRDY# = 1      

      do      wait     while BRDY# = 1      

     else     os ← 64*pa₃     data_(63+os..os) ← D_(63..0)    endif  else    CACHE# ← (size = 256)    PWT ← (cc = WT)    PCD ← (cc ≦ CD)   do     wait    while (BRDY# = 1) and (NA# =1)    xen ← WB/WT# and (cc≠ WT)   endif  endif  flags ← cen ∥ xen enddef

indicates data missing or illegible when filed

Other Bus Cycles

Input/Output transfers, Interrupt acknowledge and special bus cycles(stop grant, flush acknowledge, writeback, halt, flush, shutdown) areperformed by uncached loads and stores to a memory-mapped controlregion.

M/ IO# D/C# W/R# CACHE# KEN# cycle 0 0 0 1 x interrupt acknowledge 0 0 11 x special cycles (intel pg 6-33) 0 1 0 1 x I/O read, 32-bits or less,non- cacheable, 16-bit address 0 1 1 1 x I/O write, 32-bits or less,non- cacheable, 16-bit address 1 0 x x x code read (not implemented) 1 10 1 x non-cacheable read 1 1 0 x 1 non-cacheable read 1 1 0 0 0cacheable read 1 1 1 1 x non-cacheable write 1 1 1 0 x cache writeback

Special Cycles

An interrupt acknowlege cycle is performed by two byte loads to thecontrol space (dc=1), the first with a byte address (ba) of 4 (A31..3=0,BE4#=0, BE7..5, 3..0#=1), the second with a byte address (ba) of 0(A31..3=0, BE0#=0, BE7..1#=1). The first byte read is ignored; thesecond byte contains the interrupt vector. The external system normallyreleases INTR between the first and second byte load.

A shutdown special cycle is performed by a byte store to the controlspace (dc=1a byte address (ba) of 0 (A31..3=0, BE0#=0, BE7..1#=1).

A flush special cycle is performed by a byte store to the control space(dc=1) with a byte address (ba) of 1 (A31..3=0, BE1#=0, BE7..2, 0#=1).

A halt special cycle is performed by a byte store to the control space(dc=1) with a byte address (ba) of 2 (A31..3=0, BE2#=0, BE7..3,1..0#=1).

A stop grant special cycle is performed by a byte store to the controlspace (dc=1) with a byte address (ba) of 0x12 (A31..3=2, BE2#=0, BE7..3,1..0#=1).

A writeback special cycle is performed by a byte store to the controlspace (dc=1) with a byte address (ba) of 3 (A31..3=0, BE3#=0, BE7..4,2..0#=1).

A flush acknowledge special cycle is performed by a byte store to thecontrol space (dc=1) with a byte address (ba) of 4 (A31..3=0, BE4#=0,BE7..5, 3..0#=1).

A back trace message special cycle is performed by a byte store to thecontrol space (dc=1) with a byte address (ba) of 5 (A31..3=0, BE5#=0,BE7..6, 4..0#=1).

Performing load or store operations of other sizes (doublet, quadlet,octlet, hexlet) to the control space (dc=1) or operations with otherbyte address (ba) values produce bus operations which are not defined bythe “Super Socket 7” specifications and have undefined effect on thesystem.

I/O Cycles

An input cycle is performed by a byte, doublet, or quadlet load to thedata space (dc=0), with a byte address (ba) of the I/O address. Theaddress may not be aligned, and if it crosses an octlet boundary, willbe performed as two separate cycles.

An output cycle is performed by a byte, doublet, or quadlet store to thedata space (dc=0), with a byte address (ba) of the UO address. Theaddress may not be aligned, and if it crosses an octlet boundary, willbe performed as two separate cycles.

Performing load or store operations of other sizes (octlet, hexlet) tothe data space (dc=0) produce bus operations which are not defined bythe “Super Socket 7” specifications and have undefined effect on thesystem.

Physical Address

The other bus cycles are accessed explicitly by uncached memory accessesto particular physical address ranges. Appropriately sized load andstore operations must be used to perform the specific bus cyclesrequired for proper operations. The dc field must equal 0 for I/Ooperations, and must equal 1 for control operations. Within this addressrange, bus transactions are sized no greater than 4 bytes (quadlet) anddo not cross quadlet boundaries.

The physical address of a other bus cycle data/control dc, byte addressba is:

Definition

def data ← AccessPhysicaOtherBus(pa,size,op,wd) as  // divide transferssized between octlet and hexlet into two parts  // also divide transferswhich cross octlet boundary into two parts  if (64<size≦128) or((size<64) and (size+8*pa_(2..0)>64)) then   data0 ←AccessPhysicaOtherBus(pa,64-8*pa_(2..0),op,wd)   pa1 ← pa_(63..4)∥1∥0³  data1 ← AccessPhysicaOtherBus(pa1,size+8*pa_(2..0)-64,op,wd)   data ←data1_(127..64) ∥ data0_(63..0)  else   ADS# ← 0   M/IO# ← 0   D/C# ←-pa₁₆   A_(31..3) ← 0¹⁶ ∥ pa_(15..3)   for i ← 0 to 7    BE_(i)# ←pa_(2..0) ≦ i < pa_(2..0)+size/8   endfor   W/R# ← (op = W)   CACHE# ← 1  PWT ← 1   PCD ← 1   do    wait   while (BRDY# = 1) and (NA# = 1)   if(op=R) then    os ← 64*pa₃     data_(63+os..os) ← D_(63..0)   end if endif enddef

Events and Threads

Exceptions signal several kinds of events: (1) events that areindicative of failure of the software or hardware, such as arithmeticoverflow or parity error, (2) events that are hidden from the virtualprocess model, such as translation buffer misses, (3) events thatinfrequently occur, but may require corrective action, such asfloating-point underflow. In addition, there are (4) external eventsthat cause scheduling of a computational process, such as clock eventsor completion of a disk transfer.

Each of these types of events require the interruption of the currentflow of execution, handling of the exception or event, and in somecases, descheduling of the current task and rescheduling of another. TheZeus processor provides a mechanism that is based on the multi-threadedexecution model of Mach. Mach divides the well-known UNIX process modelinto two parts, one called a task, which encompasses the virtual memoryspace, file and resource state, and the other called a thread, whichincludes the program counter, stack space, and other register filestate. The sum of a Mach task and a Mach thread exactly equals one UNIXprocess, and the Mach model allows a task to be associated with severalthreads. On one processor at any one moment in time, at least one taskwith one thread is running

In the taxonomy of events described above, the cause of the event mayeither be synchronous to the currently running thread, generally types1, 2, and 3, or asynchronous and associated with another task and threadthat is not currently running, generally type 4.

For these events, Zeus will suspend the currently running thread in thecurrent task, saving a minimum of registers, and continue execution at anew program counter. The event handler may perform some minimalcomputation and return, restoring the current threads' registers, orsave the remaining registers and switch to a new task or thread context.

Facilities of the exception, memory management, and interface systemsare themselves memory mapped, in order to provide for the manipulationof these facilities by high-level language, compiled code. The soleexception is the register file itself, for which standard store and loadinstructions can save and restore the state.

Definition

def Thread(th) as  forever   catch exception    if ((EventRegister andEventMask[th]) ≠ 0) then     if ExceptionState=0 then      raiseEventInterrupt     endif    endif    inst ←LoadMemoryX(ProgramCounter,ProgramCounter,32,L)    Instruction(inst)  endcatch   case exception of    EventInterrupt,   ReservedInstruction,    AccessDisallowedByVirtualAddress,   AccessDisallowedByTag,    AccessDisallowedByGlobalTB,   AccessDisallowedByLocalTB,    AccessDetailRequiredByTag,   AccessDetailRequiredByGlobalTB,    AccessDetailRequiredByLocalTB,   MissInGlobalTB,    MissInLocalTB,    Fixed PointArithmetic,   FloatingPointArithmetic,    GatewayDisallowed:     caseExceptionState of      0:       PerformException(exception)      1:      PerformException(SecondException)      2:       raiseThirdException     endcase    TakenBranch:     ContinuationState ←(ExceptionState=0) ? 0 : ContinuationState    TakenBranchContinue:    /* nothing */    none, others:     ProgramCounter ← ProgramCounter +4     ContinuationState ← (ExceptionState=0) ? 0: ContinuationState  endcase  endforever enddef

Definition

def PerformException(excepton) as  v ← (exception > 7) ? 7 : exception t ← LoadMemory(ExceptionBase,ExceptionBase+  Thread*128+64+8*v,64,L) if ExceptionState = 0 then   u ← RegRead(3,128) ∥ RegRead(2,128) ∥RegRead(1,128) ∥   RegRead(0,128)  StoreMemory(ExceptionBase,ExceptionBase+Thread*128,512,L,u)  RegWrite(0,64,ProgramCounter63..2 ∥ PrivilegeLevel  RegWrite(1,64,ExceptionBase+Thread*128)   RegWrite(2,64,exception)  RegWrite(3,64,FailingAddress)  endif  PrivilegeLevel ← t_(1..0) ProgramCounter ← t_(63..2) ∥ 0²  case exception of  AccessDetailRequiredByTag,   AccessDetailRequiredByGlobalTB,  AccessDetailRequiredByLocalTB:    ContinuationState ←ContinuationState + 1   others:    /* nothing */  endcase ExceptionState ← ExceptionState + 1 enddef

Definition

  def PerformAccessDetail(exception) as  if (ContinuationState = 0) or(ExceptionState ≠ 0) then   raise exception  else   ContinuationState ←ContinuationState - 1  endif enddef

Definition

def BranchBack(rd,rc,rb) as  c ← RegRead(rc, 64)  if (rd ≠ 0) or(rc ≠ 0)or (rb ≠ 0) then   raise ReservedInstruction  endif  a ←LoadMemory(ExceptionBase,  ExceptionBase+Thread*128,128,L)  ifPrivilegeLevel > c_(1..0) then   PrivilegeLevel ← c_(1..0)  endif ProgramCounter ← c_(63..2) || 0²  ExceptionState ← 0 RegWrite(rd,128,a)  raise TakenBranchContinue enddef

Definition

The following data is loaded from memory at the Exception StorageAddress:

The following data is loaded from memory at the Exception VectorAddress:

The following data replaces the original contents of RF[3..0]:

at: access type: 0=r, 1=w, 2=x, 3=gas: access size in bytesTODO: add size, access type to exception data in pseudocode.

Ephemeral Program State

Ephemeral Program State (EPS) is defined as program state which affectsthe operation of certain instructions, but which does not need to besaved and restored as part of user state.

Because these bits are not saved and restored, the sizes and valuesdescribed here are not visible to software. The sizes and valuesdescribed here were chosen to be convenient for the definitions in thisdocumentation. Any mapping of these values which does not alter thefunctions described may be used in a conforming implementation. Forexample, either of the EPS states may be implemented as athermometer-coded vector, or the ContinuationState field may berepresented with specific values for each AccessDetailRequired exceptionwhich an instruction execution may encounter.

There are eight bits of EPS:

bit# Name Meaning 1 . . . 0 ExceptionState 0: Normal processing.Asynchronous events and Synchronous exceptions enabled. 1:Event/Exception handling: Synchronous exceptions cause SecondException.Asynchronous events are masked. 2: Second exception handling:Synchronous exceptions cause machine check. Asynchronous events aremasked. 3: Illegal state This field is incremented by handling an eventor exception, and cleared by the Branch Back instruction. 7 . . . 2ContinuationState Continuation state for AccessDetailRequiredexceptions. A value of zero enables all exceptions of this kind. Thevalue is increased by one for each AccessDetailRequired exceptionhandled, for which that many AccessDetailRequired exceptions arecontinued past (ignored) on re-execution in normal processing (ex = 0).Any other kind of exception, or the completion of an instruction undernormal processing causes the continuation state to be reset to zero.State does not need to be saved on context switch.

The ContinuationState bits are ephemeral because if they are cleared asa result of a context switch, the associated exceptions can happen overagain. The AccessDetail exception handlers will then set the bits again,as they were before the context switch. In the case where anAccessDetail exception handler must indicate an error, care must betaken to perform some instruction at the target of the Branch Backinstruction by the exception handler is exited that will operateproperly with ContinuationState0.

The ExceptionState bits are ephemeral because they are explicitly set byevent handling and cleared by the termination of event handling,including event handling that results in a context switch.

Event Register

Events are single-bit messages used to communicate the occurrence ofevents between threads and interface devices.

The Event Register appears at several locations in memory, with slightlydifferent side effects on read and write operations.

offset side effect on read side effect on write  0 none: return eventregister normal: write data into event contents register

512 return zero value (so read-modify- one bits in data set (to one)write for byte/doublet/quadlet corresponding event register store works)bits 768 return zero value (so read-modify- one bits in data clear (tozero) write for byte/doublet/quadlet corresponding event register storeworks) bits

Physical Address

The Event Register appears at three different locations, for which threefunctions of the Event Register are performed as described above. Thephysical address of an Event Register for function f, byte b is:

Definition

def data ← AccessPhysicalEventRegister(pa,op,wdata) as  f ← pa_(9..8) if (pa_(23..10) = 0) and (pa_(7..4) = 0) and (f ≠ 1) then   case f ||op of    0 || R:     data ← 0⁶⁴ || EventRegister    2 || R, 3 || R:    data ← 0    0 || W:     EventRegister ← wdata_(63..0)    2 || W:    EventRegister ← EventRegister or wdata_(63..0)    3 || W:    EventRegister ← EventRegister and ~wdata_(63..0)   endcase  else  data ← 0  endif enddef

Events:

The table below shows the events and their corresponding event number.The priority of these events is soft, in that dispatching from the eventregister is controlled by software.

TODO notwithstanding the above, using the E.LOGMOST.0 instruction ishandy for prioritizing these events, so if you've got a preference as tonumbering, speak up!

number event 0 Clock 1 A20M# active 2 BF0 active 3 BF1 active 4 BF2active 5 BUSCHK# active 6 FLUSH# active 7 FRCMC# active 8 IGNNE# active9 INIT active 10 INTR active 11 NMI active 12 SMI# active 13 STPCLK#active 14 CPUTYP active at reset (Primary vs Dual processor) 15 DPEN#active at reset (Dual processing enable - driven low by dual processor)16 FLUSH# active at reset (tristate test mode) 17 INIT active at reset18 Bus lock broken 19 BRYRC# active at reset (drive strength) 20

Event Mask

The Event Mask (one per thread) control whether each of the eventsdescribed above is permitted to cause an exception in the correspondingthread.

Physical Address

There are as many Event Masks as threads. The physical address of anEvent Mask for thread th, byte b is:

Definition

def data ← AccessPhysicalEventMask(pa,op,wdata) as  th ← pa_(23...19) if (th < T) and (pa_(18..4) = 0) then   case op of    R:     data ← 0⁶⁴|| EventMask[th]    W:     EventMask[th] ← wdata_(63..0)   endcase  else  data ← 0  endif enddef

Exceptions:

The table below shows the exceptions, the corresponding exceptionnumber, and the parameter supplied by the exception handler in register3.

number exception parameter (register 3) 0 EventInterrupt 1MissInGlobalTB global address 2 AccessDetailRequiredByTag global address3 AccessDetailRequiredByGlobalTB global address 4AccessDetailRequiredByLocalTB local address 5 6 SecondException 7ReservedInstruction instruction 8 AccessDisallowedByVirtualAddress localaddress 9 AccessDisallowedByTag global address 10AccessDisallowedByGlobalTB global address 11 AccessDisallowedByLocalTBlocal address 12 MissInLocalTB local address 13 FixedPointArithmeticinstruction 14 FloatingPointArithmetic instruction 15 GatewayDisallowednone 16 17 18 19 20 21 22 23 24 25 TakenBranch TakenBranchContinue

The GlobalTBMiss Handler

The GlobalTBMiss exception occurs when a load, store, or instructionfetch is attempted while none of the GlobalTB entries contain a matchingvirtual address. The Zeus processor uses a fast software-based exceptionhandler to fill in a missing GlobalTB entry.

There are several possible ways that software may maintain page tables.For purposes of this discussion, it is assumed that a virtual page tableis maintained, in which 128 bit GTB values for each 4 k byte page in alinear table which is itself in virtual memory. By maintaining the pagetable in virtual memory, very large virtual spaces may be managedwithout keeping a large amount of physical memory dedicated to pagetables.

Because the page table is kept in virtual memory, it is possible that avalid reference may cause a second GTBMiss exception if the virtualaddress that contains the page table is not present in the GTB. Theprocessor is designed to permit a second exception to occur within anexception handler, causing a branch to the SecondException handler.However, to simplify the hardware involved, a SecondException exceptionsaves no specific information about the exception—handling depends onkeeping enough relevant information in registers to recover from thesecond exception.

Zeus is a multithreaded processor, which creates some specialconsiderations in the exception handler. Unlike a single-threadedprocessor, it is possible that multiple threads may nearlysimultaneously reference the same page and invoke two or more GTBmisses, and the fully-associative construction of the GTB requires thatthere be no more than one matching entry for each global virtualaddress. Zeus provides a search-and-insert operation (GTBUpdateFill) tosimplify the handling of the GTB. This operation also uses hardware GTBpointer registers to select GTB entries for replacement in FIFOpriority.

A further problem is that software may need to modify the protectioninformation contained in the GTB, such as to remove read and/or writeaccess to a page in order to infer which parts of memory are in use, orto remove pages from a task. These modifications may occur concurrentlywith the GTBMiss handler, so software must take care to properlysynchronize these operations. Zeus provides a search-and-updateoperation (GTBUpdate) to simplify updating GTB entries.

When a large number of page table entries must be changed, noting thelimited capacity of the GTB can reduce the work. Reading the GTB can beless work than matching all modified entries against the GTB contents.To facilitate this, Zeus also provides read access to the hardware GTBpointers to further permit scanning the G7113 for entries which havebeen replaced since a previous scan. GTB pointer wraparound is alsologged, so it can be determined that the entire GTB needs to be scannedif all entries have been replaced since a previous scan.

In the code below, offsets from r1 are used with the following datastructure

Offset Meaning  0 . . . 15 r0 save 16 . . . 32 r1 save 32 . . . 47 r2save 48 . . . 63 r3 save 512 . . . 527 r4 save 528 . . . 535 BasePT 536. . . 543 GTBUpdateFill 544 . . . 559 DummyPT 560 . . . 639 available 96bytes

On a GTBMiss, the handler retrieves a base address for the virtual pagetable and constructs an index by shifting away the page offset bits ofthe virtual address. A single 128-bit indexed load retrieves the new GTBentry directly (except that a virtual page table miss causes a secondexception, handled below). A single 128-bit store to the GTBUpdateFilllocation places the entry into the GTB, after checking to ensure that aconcurrent handler has not already placed the entry into the GTB.

Code for GlobalTBMiss:

li64la r2=r1,BasePT //base address for page table ashri r3@12 //4k pagesl128la r3=r2,r3 //retrieve page table, SecExc if bad va 2: li64lar2=r1,GTBUpdateFill //pointer to GTB update location si128la r3,r2,0//save new TB entry li128la r3=r1,48 //restore r3 li128la r2=r1,32//restore r2 li128la r1=r1,16 //restore r1 bback //restore r0 and return

A second exception occurs on a virtual page table miss. It is possibleto service such a page table miss directly, however, the page offsetbits of the virtual address have been shifted away, and have been lost.These bits can be recovered: in such a case, a dummy GTB entry isconstructed, which will cause an exception other than GTBMiss uponreturning. A re-execution of the offending code will then invoke a moreextensive handler, making the full virtual address available.

For purposes of this example, it is assumed that checking the contentsof r2 against the 20 contents of BasePT is a good way to ensure that thesecond exception handler was entered from the GlobalTBMiss handler.

Code for SecondException:

si128la r4,r1,512 //save r4 li64la r4=r1,BasePT //base address for pagetable bne r2,r4,1f //did we lose at page table load? li128lar2=r1,DummyPT //dummy page table, shifted left 64-12 bits xshlmi128r3@r2,64+12 //combine page number with dummy entry li128la r4=r1,512//restore r4 b 2b //fall back into GTB Miss handler 1:

Exceptions in Detail

There are no special registers to indicate details about the exception,such as the virtual address at which an access was attempted, or theoperands of a floating-point operation that results in an exception.Instead, this information is available via general-purpose registers orregisters stored in memory.

When a synchronous exception or asynchronous event occurs, the originalcontents of registers 0.3 are saved in memory and replaced with (0)program counter, privilege level, and ephemeral program state, (1) eventdata pointer, (2) exception code, and (3) when applicable, failingaddress or instruction. A new program counter and privilege level isloaded from memory and execution begins at the new address. Afterhandling the exception and restoring all but one register, a branch-backinstruction restores the final register and resumes execution.

During exception handling, any asynchronous events are kept pendinguntil a BranchBack instruction is performed. By this mechanism, we canhandle exceptions and events one at a time, without the need tointerrupt and stack exceptions. Software should take care to avoidkeeping the handling of asynchronous events pending for too long.

When a second exception occurs in a thread which is handling anexception, all the above operations occur, except for the saving andreplacing of registers 0.3 in memory. A distinct exception codeSecondException replaces the normal exception code. By this mechanism, afast exception handler for GlobalTBMiss can be written, in which asecond GlobalTBMiss or FixedPointOverflow exception may safely occur.

When a third exception occurs in a thread which is handling anexception, an immediate transfer of control occurs to the machine checkvector address, with information about the exception available in themachine check cause field of the status register. The transfer ofcontrol may overwrite state that may be necessary to recover from theexception; the intent is to provide a satisfactory post-mortemindication of the characteristics of the failure.

This section describes in detail the conditions under which exceptionsoccur, the parameters passed to the exception handler, and the handlingof the result of the procedure.

Reserved Instruction

The Reservedlnstruction exception occurs when an instruction code whichis reserved for future definition as part of the Zeus architecture isexecuted.

Register 3 contains the 32-bit instruction.

Access Disallowed by Virtual Address

This exception occurs when a load, store, branch, or gateway refers toan aligned memory operand with an improperly aligned address, or ifarchitecture description parameter LB=1, may also occur if the add orincrement of the base register or program counter which generates theaddress changes the unmasked upper 16 bits of the local address.

Register 3 contains the local address to which the access was attempted.

Access Disallowed by Tag

This exception occurs when a read (load), write (store), execute, orgateway attempts to access a virtual address for which the matchingcache tag entry does not permit this access.

Register 3 contains the global address to which the access wasattempted.

Access Detail Required by Tag

This exception occurs when a read (load), write (store), or executeattempts to access a virtual address for which the matching virtualcache entry would permit this access, but the detail bit is set.

Register 3 contains the global address to which the access wasattempted.

Description

The exception handler should determine accessibility. If the accessshould be allowed, the continuepastdetail bit is set and executionreturns. Upon return, execution is restarted and the access will beretried. Even if the detail bit is set in the matching virtual cacheentry, access will be permitted.

Access Disallowed by Global TB

This exception occurs when a read (load), write (store), execute, orgateway attempts to access a virtual address for which the matchingglobal TB entry does not permit this access.

Register 3 contains the global address to which the access wasattempted.

Description

The exception handler should determine accessibility, modify the virtualmemory state if desired, and return if the access should be allowed.Upon return, execution is restarted and the access will be retried.

Access Detail Required by Global TB

This exception occurs when a read (load), write (store), execute, orgateway attempts to access a virtual address for which the matchingglobal TB entry would permit this access, but the detail bit in theglobal TB entry is set.

Register 3 contains the global address to which the access wasattempted.

Description

The exception handler should determine accessibility and return if theaccess should be allowed. Upon return, execution is restarted and theaccess will be allowed. If the access is not to be allowed, the handlershould not return.

Global TB miss

This exception occurs when a read (load), write (store), execute, orgateway attempts to access a virtual address for which no global TBentry matches.

Register 3 contains the global address to which the access wasattempted.

Description

The exception handler should load a global TB entry that defines thetranslation and protection for this address. Upon return, execution isrestarted and the global TB access will be attempted again.

Access Disallowed by Local TB

This exception occurs when a read (load), write (store), execute, orgateway attempts to access a virtual address for which the matchinglocal TB entry does not permit this access.

Register 3 contains the local address to which the access was attempted.

Description

The exception handler should determine accessibility, modify the virtualmemory state if desired, and return if the access should be allowed.Upon return, execution is restarted and the access will be retried.

Access Detail Required by Local TB

This exception occurs when a read (load), write (store), execute, orgateway attempts to access a virtual address for which the matchinglocal TB entry would permit this access, but the detail bit in the localTB entry is set.

Register 3 contains the local address to which the access was attempted.

Description

The exception handler should determine accessibility and return if theaccess should be allowed. Upon return, execution is restarted and theaccess will be allowed. If the access is not to be allowed, the handlershould not return.

Local TB Miss

This exception occurs when a read (load), write (store), execute, orgateway attempts to access a virtual address for which no local TB entrymatches.

Register 3 contains the local address to which the access was attempted.

Description

The exception handler should load a local TB entry that defines thetranslation and protection for this address. Upon return, execution isrestarted and the local TB access will be attempted again.

Floating-Point Arithmetic

Register 3 contains the 32-bit instruction.

Description

The address of the instruction that was the cause of the exception ispassed as the contents of register 0. The exception handler shouldattempt to perform the function specified in the instruction and serviceany exceptional conditions that occur.

Fixed-Point Arithmetic

Register 3 contains the 32-bit instruction.

Description

The address of the instruction which was the cause of the exception ispassed as the contents of register 0. The exception handler shouldattempt to perform the function specified in the instruction and serviceany exceptional conditions that occur.

Reset and Error Recovery

Certain external and internal events cause the processor to invoke resetor error recovery operations. These operations consist of a full orpartial reset of critical machine state, including initialization of thethreads to begin fetching instructions from the start vector address.Software may determine the nature of the reset or error by reading thevalue of the control register, in which finding the reset bit set (1)indicates that a reset has occurred, and finding both the reset bitcleared (0) indicates that a machine check has occurred. When either areset or machine check has been indicated, the contents of the statusregister contain more detailed information on the cause.

Definition

def PerformMachineCheck(cause) as  ResetVirtualMemory( )  ProgramCounter← StartVectorAddress  PrivilegeLevel ← 3  StatusRegister ← cause enddef

Reset

A reset may be caused by a power-on reset, a bus reset, a write of thecontrol register which sets the reset bit, or internally detected errorsincluding meltdown detection, and double check.

A reset causes the processor to set the configuration to minimum powerand low clock speed, note the cause of the reset in the status register,stabilize the phase locked loops, disable the MMU from the controlregister, and initialize a all threads to begin execution at the startvector address.

Other system state is left undefined by reset and must be explicitlyinitialized by software; this explicitly includes the thread registerstate, LTB and GTB state, superspring state, and external interfacedevices. The code at the start vector address is responsible forinitializing these remaining system facilities, and reading furtherbootstrap code from an external ROM.

Power-On Reset

A reset occurs upon initial power-on. The cause of the reset is noted byinitializing the Status Register and other registers to the reset valuesnoted below.

Bus Reset

A reset occurs upon observing that the RESET signal has been at active.The cause of the reset is noted by initializing the Status Register andother registers to the reset values noted below.

Control Register Reset

A reset occurs upon writing a one to the reset bit of the ControlRegister. The cause of the reset is noted by initializing the StatusRegister and other registers to the reset values noted below.

Meltdown Detected Reset

A reset occurs if the temperature is above the threshold set by themeltdown margin field of the configuration register. The cause of thereset is noted by setting the meltdown detected bit of the StatusRegister.

Double Check Reset

A reset occurs if a second machine check occurs that prevents recoveryfrom the first machine check. Specifically, the occurrence of anexception in event thread, watchdog timer error, or bus error while anymachine check cause bit is still set in the Status Register results in adouble machine check reset. The cause of the reset is noted by settingthe double check bit of the Status Register.

Machine Check

Detected hardware errors, such as communications errors in the bus, awatchdog timeout error, or internal cache parity errors, invoke amachine check. A machine check will disable the MMU, to translate alllocal virtual addresses to equal physical addresses, note the cause ofthe exception in the Status Register, and transfer control of the allthreads to the start vector address. This action is similar to that of areset, but differs in that the configuration settings, and thread stateare preserved.

Recovery from machine checks depends on the severity of the error andthe potential loss of information as a direct cause of the error. Thestart vector address is designed to reach internal ROM memory, so thatoperation of machine check diagnostic and recovery code need not dependon proper operation or contents of any external device. The programcounter and register file state of the thread prior to the machine checkis lost (except for the portion of the program counter saved in theStatus Register), so diagnostic and recovery code must not assume thatthe register file state is indicative of the prior operating state ofthe thread. The state of the thread is frozen similarly to that of anexception.

Machine check diagnostic code determines the cause of the machine checkfrom the processor's Status Register, and as required, the status andother registers of external bus devices.

Recovery code will generally consume enough time that real-timeinterface performance targets may have been missed. Consequently, themachine check recovery software may need to repair further damage, suchas interface buffer underruns and overruns as may have occurred duringthe intervening time.

This final recovery code, which re-initializes the state of theinterface system and recovers a functional event thread state, mayreturn to using the complete machine resources, as the condition whichcaused the machine check will have been resolved.

The following table lists the causes of machine check errors.

machine check errors Parity or uncorrectable error in on-chip cacheParity or communications error in system bus Event Thread exceptionWatchdog timer

Parity or Uncorrectable Error in Cache

When a parity or uncorrectable error occurs in an on-chip cache, such anerror is generally non-recoverable. These errors are non-recoverablebecause the data in such caches may reside anywhere in memory, andbecause the data in such caches may be the only up-to-date copy of thatmemory contents. Consequently, the entire contents of the memory storeis lost, and the severity of the error is high enough to consider such acondition to be a system failure.

The machine check provides an opportunity to report such an error beforeshutting down a system for repairs.

There are specific means by which a system may recover from such anerror without failure, such as by restarting from a system-levelcheckpoint, from which a consistent memory state can be recovered.

Parity or Communications Error in Bus

When a parity or communications error occurs in the system bus, such anerror may be partially recoverable.

Bits corresponding to the affected bus operation are set in theprocessor's Status Register. Recovery software should determine whichdevices are affected, by querying the Status Register of each device onthe affected MediaChannel channels.

A bus timeout may result from normal self-configuration activities.

If the error is simply a communications error, resetting appropriatedevices and restarting tasks may recover from the error. Read and writetransactions may have been underway at the time of a machine check andmay or may not be reflected in the current system state.

If the error is from a parity error in memory, the contents of theaffected area of memory is lost, and consequently the tasks associatedwith that memory must generally be aborted, or resumed from a task-levelcheckpoint. If the contents of the affected memory can be recovered frommass storage, a complete recovery is possible.

If the affected memory is that of a critical part of the operatingsystem, such a condition is considered a system failure, unless recoverycan be accomplished from a system-level checkpoint.

Watchdog Timeout Error

A watchdog timeout error indicates a general software or hardwarefailure. Such an error is generally treated as non-recoverable andfatal.

Event Thread Exception

When an event thread suffers an exception, the cause of the exceptionand a portion of the virtual address at which the exception occurred arenoted in the Status Register. Because under normal circumstances, theevent thread should be designed not to encounter exceptions, suchexceptions are treated as non-recoverable, fatal errors.

Reset State

A reset or machine check causes the Zeus processor to stabilize thephase locked loops, disable the local and global 113, to translate alllocal virtual addresses to equal physical addresses, and initialize allthreads to begin execution at the start vector address.

Start Address

The start address is used to initialize the threads with a programcounter upon a reset, or machine check. These causes of suchinitialization can be differentiated by the contents of the StatusRegister.

The start address is a virtual address which, when “translated” by thelocal TB and global TB to a physical address, is designed to access theinternal ROM code. The internal ROM space is chosen to minimize thenumber of internal resources and interfaces that must be operated tobegin execution or recover from a machine check.

Virtual/physical address description 0xFFFF FFFF FFFF FFFC start vectoraddress

Definition

def StartProcessor as  forever   catch check    EnableWatchdog ← 0   fork RunClock    ControlRegister₆₂ ← 0    for th ← 0 to T-1    ProgramCounter[th] ← 0xFFFF FFFF FFFF FFFC     PrivilegeLevel[th] ←3     fork Thread(th)    endfor   endcatch   kill RunClock   for th ← 0to T-1    kill Thread(th)   endfor   PerformMachineCheck(check) endforever enddef def PerformMachineCheck(check) as  case check of  ClockWatchdog:   CacheError:   ThirdException:  endcase enddef

Internal ROM Code

Zeus internal ROM code performs reset initialization of on-chipresources, including the LZC and LOC, followed by self-testing. The BIOSROM should be scanned for a special prefix that indicates that Zeusnative code is present in the ROM, in which case the ROM code isexecuted directly, otherwise execution of a BIOS-level x86 emulator isbegun.

Memory and Devices Physical Memory Map

Zeus defines a 64-bit physical address, but while residing in a S7pin-out, can address a maximum of 4 Gb of main memory. In other packagesthe core Zeus design can provide up to 64-bit external physical addressspaces. Bit 63..32 of the physical address distinguishes betweeninternal (on-chip) physical addresses, where bits 63..32=FFFFFFFF, andexternal (off-chip) physical addresses, where bits 63..32FFFFFFFF.

Address range bytes Meaning 0000 0000 0000 0000..0000 4G External Memory0000 FFFF FFFF 0000 0001 0000 0000..FFFF 16E−8G External Memoryexpansion FFFE FFFF FFFF FFFF FFFF 0000 0000..FFFF 128K+4K Level OneCache FFFF 0002 0FFF FFFF FFFF 0002 1000..FFFF 144M−132K Level One Cacheexpansion FFFF 08FF FFFF FFFF FFFF 0900 0000..FFFF 128 Level One Cacheredundancy FFFF 0900 007F FFFF FFFF 0900 0080..FFFF 16M−128 LOCredundancy expansion FFFF 09FF FFFF FFFF FFFF 0A00 0000+t*2¹⁹+e*168*T*2^(LE) LTB thread t entry e FFFF FFFF 0A00 0000..FFFF 8*T*2^(LE) LTBmax 8*T*2^(LE) = 16M bytes FFFF 0AFF FFFF FFFF FFFF 0B00 0000..FFFF 16MSpecial Bus Operations FFFF 0BFF FFFF FFFF FFFF 0C00 T2^(4+GE−) GTBthread entry e 0000+t_(5..GT)*2^(19+GT)+e*16 GT FFFF FFFF 0C000000..FFFF T2^(4+GE−) GTB max 2⁵⁺⁴⁺¹⁵ = 16M bytes FFFF 0CFF FFFF GT FFFFFFFF 0D00 0000+t_(5..GT)*2^(19+GT) 16*T*2^(−GT) GTBUpdate thread t FFFFFFFF 0D00 0100+t_(5..GT)*2^(19+GT) 16*T*2^(−GT) GTBUpdateFill thread tFFFF FFFF 0D00 0200+t_(5..GT)*2^(19+GT) 8*T*2^(−GT) GTBLast thread tFFFF FFFF 0D00 0300+t_(5..GT)*2^(19+GT) 8*T*2^(−GT) GTBFirst thread tFFFF FFFF 0D00 0400+t_(5..GT)*2^(19+GT) 8*T*2^(−GT) GTBBump thread tFFFF FFFF 0E00 0000+t*2¹⁹ 8T Event Mask thread t FFFF FFFF 0F000008..FFFF 256-8 Reserved FFFF 0F00 00FF FFFF FFFF 0F00 0100..FFFF  8

FFFF 0F00 0107 FFFF FFFF 0F00 0108..FFFF 256-8 Reserved FFFF 0F00 01FFFFFF FFFF 0F00 0200..FFFF  8 Event Register bit set FFFF 0F00 0207 FFFFFFFF 0F00 0208..FFFF 256-8 Reserved FFFF 0F00 02FF FFFF FFFF 0F000300..FFFF  8 Event Register bit clear FFFF 0F00 0307 FFFF FFFF 0F000308..FFFF 256-8 Reserved FFFF 0F00 03FF FFFF FFFF 0F00 0400..FFFF  8Clock Cycle FFFF 0F00 0407 FFFF FFFF 0F00 0408..FFFF 256-8 Reserved FFFF0F00 04FF FFFF FFFF 0F00 0500..FFFF  8 Thread FFFF 0F00 0507 FFFF FFFF0F00 0508..FFFF 256-8 Reserved FFFF 0F00 05FF FFFF FFFF 0F00 0600..FFFF 8 Clock Event FFFF 0F00 0607 FFFF FFFF 0F00 0608..FFFF 256-8 ReservedFFFF 0F00 06FF FFFF FFFF 0F00 0700..FFFF  8 Clock Watchdog FFFF 0F000707 FFFF FFFF 0F00 0708..FFFF 256-8 Reserved FFFF 0F00 07FF FFFF FFFF0F00 0800..FFFF  8 Tally Counter 0 FFFF 0F00 0807 FFFF FFFF 0F000808..FFFF 256-8 Reserved FFFF 0F00 08FF FFFF FFFF 0F00 0900..FFFF  8Tally Control 0 FFFF 0F00 0907 FFFF FFFF 0F00 0908..FFFF 256-8 ReservedFFFF 0F00 09FF FFFF FFFF 0F00 0A00..FFFF  8 Tally Counter 1 FFFF 0F000A07 FFFF FFFF 0F00 0A08..FFFF 256-8 Reserved FFFF 0F00 0AFF FFFF FFFF0F00 0B00..FFFF  8 Tally Control 1 FFFF 0F00 0B07 FFFF FFFF 0F000B08..FFFF 256-8 Reserved FFFF 0F00 0BFF FFFF FFFF 0F00 0C00..FFFF  8Exception Base FFFF 0F00 0C07 FFFF FFFF 0F00 0C08..FFFF 512-8 ReservedFFFF 0F00 0CFF FFFF FFFF 0F00 0C00..FFFF  8 Bus Control Register FFFF0F00 0D07 FFFF FFFF 0F00 0D08..FFFF 512-8 Reserved FFFF 0F00 0DFF FFFFFFFF 0F00 0E00..FFFF  8 Status Register FFFF 0F00 0E07 FFFF FFFF 0F000208..FFFF 256-8 Reserved FFFF 0F00 02FF FFFF FFFF 0F00 0F00..FFFF  8Control Register FFFF 0F00 0F07 FFFF FFFF 0F00 0F08..FFFF Reserved FFFFFEFF FFFF FFFF FFFF FF00 0000..FFFF 16M−65k Internal ROM expansion FFFFFFFE FFFF FFFF FFFF FFFF 0000..FFFF 64k Internal ROM FFFF FFFF FFFF

letter name 2^(x) “binary” 10^(y) “decimal” b bits B bytes 0 1 0 1 Kkilo 10 1 024 3 1 000 M mega 20 1 048 576 6 1 000 000 G giga 30 1 073741 824 9 1 000 000 000 T tera 40 1 099 511 627 776 12 1 000 000 000 000P peta 50 1 125 899 906 842 624 15 1 000 000 000 000 000 E exa 60 1 152921 504 606 846 976 18 1 000 000 000 000 000 000

Definition

def data ← ReadPhysical(pa,size) as  data,flags ←AccessPhysical(pa,size,WA,R,0) enddef def WritePhysical(pa,size,wdata)as  data,flags ← AccessPhysical(pa,size,WA,W,wdata) enddef defdata,flags ← AccessPhysical(pa,size,cc,op,wdata) as if (0x0000000000000000 ≦ pa ≦ 0x00000000FFFFFFFF) then   data,flags ←AccessPhysicalBus(pa,size,cc,op,wdata)  else   data ←AccessPhyiscalDevices(pa,size,op,wdata)   flags ← 1  endif enddef defdata ← AccessPhysicalDevices(pa,size,op,wdata) as  if (size=256) then  data0 ← AccessPhysicalDevices(pa,128.op.wdata_(127..0))   data1 ←AccessPhysicalDevices(pa+16,128.op.wdata_(255..128))   data ← data1 ||data0  elseif (0xFFFFFFFF0B000000 ≦ pa ≦ 0xFFFFFFFF0BFFFFFF) then  //don't perform RMW on this region   data ←AccessPhysicalOtherBus(pa,size,op,wdata)  elseif (op=W) and (size<128)then   //this code should change to check pa4..0≠0 and size<sizeofreg  rdata ← AccessPhysicalDevices(pa and ~15,128,R,0)   bs ← 8*(pa and 15)  be ← bs + size   hdata ← rdata_(127..be) || wdata_(be−1..bs) ||rdata_(bs−1..0)   data ← AccessPhysicalDevices(pa and ~15,128,W,hdata) elseif (0x0000000100000000 ≦ pa ≦ 0xFFFFFFEFFFFFFFF) then   data ← 0 elseif (0xFFFFFFFF00000000 ≦ pa ≦ 0xFFFFFFFF08FFFFFF) then   data,←AccessPhysicalLOC(pa,op,wdata)  elseif (0xFFFFFFFF09000000 ≦ pa ≦0xFFFFFFFF09FFFFFF) then   data ←AccessPhysicalLOCRedundancy(pa,op,wdata)  elseif (0xFFFFFFFF0A000000 ≦pa ≦ 0xFFFFFFFF0AFFFFFF) then   data ← AccessPhysicalLTB(pa,op,wdata) elseif (0xFFFFFFFF0C000000 ≦ pa ≦ 0xFFFFFFFF0CFFFFFF) then   data ←AccessPhysicalGTB(pa,op,wdata)  elseif (0xFFFFFFFF0D000000 ≦ pa ≦0xFFFFFFFF0DFFFFFF) then   data ←AccessPhysicalGTBRegisters(pa,op,wdata)  elseif (0xFFFFFFFF0E000000 ≦ pa≦ 0xFFFFFFFF0EFFFFFF) then   data ← AccessPhysicalEventMask(pa,op,wdata) elseif (0xFFFFFFFF0F000000 ≦ pa ≦ 0xFFFFFFFF0FFFFFFF) then   data ←AccessPhysicalSpecialRegisters(pa,op,wdata)  elseif (0xFFFFFFFF10000000≦ pa ≦ 0xFFFFFFFFFEFFFFFF) then   data ← 0  elseif (0xFFFFFFFFFF000000 ≦pa ≦ 0xFFFFFFFFFFFFFFFF) then   data ← AccessPhysicalROM(pa,op,wdata) end if enddef def data ← AccessPhysicalSpecialRegisters(pa,op,wdata) as if (pa_(7..0) ≧ 0x10) then   data ← 0  elseif (0xFFFFFFFF0F000000 ≦ pa≦ 0xFFFFFFFF0F0003FF) then   data ←AccessPhysicalEventRegister(pa,op,wdata)  elseif (0xFFFFFFFF0F000500 ≦pa ≦ 0xFFFFFFFF0F0005FF) then   data,← AccessPhysicalThread(pa,op,wdata) elseif (0xFFFFFFFF0F000400 ≦ pa ≦ 0xFFFFFFFF0F0007FF) then   data,←AccessPhysicalClock(pa,op,wdata)  elseif (0xFFFFFFFF0F000800 ≦ pa ≦0xFFFFFFFF0F000BFF) then   data,← AccessPhysicalTally(pa,op,wdata) elseif (0xFFFFFFFF0F000C00 ≦ pa ≦ 0XFFFFFFFF0F000CFF) then   data,←AccessPhysicalExceptionBase(pa,op,wdata)  elseif (0xFFFFFFFF0F000D00 ≦pa ≦ 0xFFFFFFFF0F000DFF) then   data,←AccessPhysicalBusControl(pa,op,wdata)  elseif (0xFFFFFFFF0F000E00 ≦ pa ≦0xFFFFFFFF0F000EFF) then   data,← AccessPhysicalStatus(pa,op,wdata) elseif (0xFFFFFFFF0F000F00 ≦ pa ≦ 0xFFFFFFFF0F000FFF) then   data,←AccessPhysicalControl(pa,op,wdata)  endif enddef

Architecture Description Register

The last hexlet of the internal ROM contains data that describesimplementation-dependent choices within the architecture specification.The last quadlet of the internal ROM contains a branch-immediateinstruction, so the architecture description is limited to 96 bits.

Address range bytes Meaning FFFF FFFF FFFF FFFC..FFFF  4 Reset addressFFFF FFFF FFFF FFFF FFFF FFFF FFF0..FFFF 12 Architecture DescriptionRegister FFFF FFFF FFFB

The table below indicates the detailed layout of the ArchitectureDescription Register.

bits field name value range interpretation 127 . . . 96  bi startContains a branch instruction for bootstrap from internal ROM 95 . . .23 0 0 0 reserved 22 . . . 21 GT 1 0 . . . 3 log₂ threads which share aglobal TB 20 . . . 17 GE 7 0 . . . 15 log₂ entries in global TB 16 LB 10 . . . 1 local TB based on base register 15 . . . 14 LE 1 0 . . . 3log₂ entries in local TB (per thread) 13 CT 1 0 . . . 1 dedicated tagsin first-level cache 12 . . . 10 CS 2 0 . . . 7 log₂ cache blocks infirst-level cache set 9 . . . 5 CE 9 0 . . . 31 log₂ cache blocks infirst-level cache 4 . . . 0 T 4 1 . . . 31 number of execution threads

The architecture description register contains a machine-readableversion of the architecture framework parameters: T, CE, CS, CT, LE, GE,and GT described in the Architectural Framework section on page 25.

Status Register

The status register is a 64-bit register with both read and writeaccess, though the only legal value which may be written is a zero, toclear the register. The result of writing a non-zero value is notspecified.

bits field name value range interpretation 63 power-on 1 0 . . . 1 Thisbit is set when a power-on reset has caused a reset. 62 internal reset 00 . . . 1 This bit is set when writing to the control register caused areset. 61 bus reset 0 0 . . . 1 This bit is set when a bus reset hascaused a reset. 60 double check 0 0 . . . 1 This bit is set when adouble machine check has caused a reset. 59 meltdown 0 0 . . . 1 Thisbit is set when a meltdown detector has caused a reset. 58 . . . 56 0 0*0 Reserved for other machine check causes. 55 event exception 0 0 . . .1 This bit is set when an exception in event thread has caused a machinecheck. 54 watchdog 0 0 . . . 1 This bit is set when a watchdog timeouthas caused timeout a machine check 53 bus error 0 0 . . . 1 This bit isset when a bus error has caused a machine check 52 cache error 0 0 . . .1 This bit is set when a cache error has caused a machine reset 51 vmerror 0 0 . . . 1 This bit is set when a virtual memory error has causeda machine check. 50 . . . 48 0 0* 0 Reserved for other machine checkcauses. 47 . . . 32 machine check 0* 0 . . . 4095 Set to exception codeif Exception in event thread. detail Set to bus error code is bus error.31 . . . 0  machine check 0 0 Set to indicate bits 31 . . . 0 of thevalue of the thread 0 program program counter at the initiation of amachine counter check.

The power-on bit of the status register is set upon the completion of apower-on reset.

The bus reset bit of the status register is set upon the completion of abus reset initiated by the RESET pin of the Socket 7 interface.

The double check bit of the status register is set when a second machinecheck occurs that prevents recovery from the first machine check, orwhich is indicative of machine check recovery software failure.Specifically, the occurrence of an event exception, watchdog timeout,bus error, or meltdown while any reset or machine check cause bit of thestatus register is still set results in a double check reset.

The meltdown bit of the status register is set when the meltdowndetector has discovered an on-chip temperature above the threshold setby the meltdown threshold field of the control register, which causes areset to occur.

The event exception bit of the status register is set when an eventthread suffers an exception, which causes a machine check. The exceptioncode is loaded into the machine check detail field of the statusregister, and the machine check program counter is loaded with thelow-order 32 bits of the program counter and privilege level.

The watchdog timeout bit of the status register is set when the watchdogtimer register is equal to the clock cycle register, causing a machinecheck.

The bus error bit of the status register is set when a bus transactionerror (bus timeout, invalid transaction code, invalid address, parityerrors) has caused a machine check.

The cache error bit of the status register is set when a cache error,such as a cache parity error has caused a machine check.

The vm error bit of the status register is set when a virtual memoryerror, such as a GTB multiple-entry selection error has caused a machinecheck.

The machine check detail field of the status register is set when amachine check has been completed. For an exception in event thread, thevalue indicates the type of exception for which the most recent machinecheck has been reported. For a bus error, this field may indicateadditional detail on the cause of the bus error. For a cache error, thisfield may indicate the address of the error at which the cache parityerror was detected

The machine check program counter field of the status register is loadedwith bits 31..0 of the program counter and privilege level at which themost recent machine check has occurred. The value in this field providesa limited diagnostic capability for purposes of software development, orpossibly for error recovery.

Physical Address

The physical address of the Status Register, byte b is:

Definition

def data ← AccessPhysicalStatus(pa,op,wdata) as  case op of   R:    data← 0⁶⁴ || StatusRegister   W:    StatusRegister ← wdata_(63..0)  endcaseenddef

Control Register

The control register is a 64-bit register with both read and writeaccess. It is altered only by write access to this register.

bits field name value range interpretation 63 reset 0 0 . . . 1 set toinvoke internal reset 62 MMU 0 0 . . . 1 set to enable the MMU 61 LOCparity 0 0 . . . 1 set to enable LOC parity 60 meltdown 0 0 . . . 1 setto enable meltdown detector 59 . . . 57 LOC timing 0 0 . . . 7 adjustLOC timing 0↓slow . . . 7↓fast 56 . . . 55 LOC stress 0 0 . . . 3 adjustLOC stress0↓normal 54 . . . 52 clock timing 0 0 . . . 7 adjust clocktiming 0↓slow . . . 7↓fast 51 . . . 12 0 0 0 Reserved 11 . . . 8  globalaccess 0* 0 . . . 15 global access 7 . . . 0 niche limit 0* 0 . . . 127niche limit

The reset bit of the control register provides the ability to reset anindividual Zeus device in a system. Writing a one (1) to this bit isequivalent to a power-on reset or a bus reset.

The duration of the reset is sufficient for the operating state changesto have taken effect. At the completion of the reset operation, theinternal reset bit of the status register is set and the reset bit ofthe control register is cleared (0).

The MMU bit of the control register provides the ability to enable ordisable the MMU features of the Zeus processor. Writing a zero (0) tothis bit disables the MMU, causing all 20 MMU-related exceptions to bedisabled and causing all load, store, program and gateway virtualaddresses to be treated as physical addresses. Writing a one (1) to thisbit enables the MMU and

MMU-related exceptions. On a reset or machine check, this bit is cleared(0), thus disabling the MMU.

The parity bit of the control register provides the ability to enable ordisable the cache parity feature of the Zeus processor. Writing a zero(0) to this bit disables the parity check, causing the parity checkmachine check to be disabled. Writing a one (1) to this bit enables thecache parity machine check. On a reset or machine check, this bit iscleared (0), thus disabling the cache parity check.

The meltdown bit of the control register provides the ability to enableor disable the meltdown detection feature of the Zeus processor. Writinga zero (0) to this bit disables the meltdown detector, causing themeltdown detected machine check to be disabled. Writing a one (1) tothis bit enables the meltdown detector. On a reset or machine check,this bit is cleared (0), thus disabling the meltdown detector.

The LOC timing bits of the control register provide the ability toadjust the cache timing of the Zeus processor. Writing a zero (0) tothis field sets the cache timing to its slowest state, enhancingreliability but limiting clock rate. Writing a seven (7) to this fieldsets the cache timing to its fastest state, limiting reliability butenhancing performance. On a reset or machine check, this field iscleared (0), thus providing operation at low clock rate. Changing thisregister should be performed when the cache is not actively beingoperated.

The LOC stress bits of the control register provide the ability tostress the LOC parameters by adjusting voltage levels within the LOC.Writing a zero (0) to this field sets the cache parameters to its normalstate, enhancing reliability. Writing a non-zero value (1, 2, or 3) tothis field sets the cache parameters to levels at which cachereliability is slightly compromised.

The stressed parameters are used to cause LOC cells with marginalperformance to fail during self-test, so that redundancy can be employedto enhance reliability. On a reset or machine check, this field iscleared (0), thus providing operation at normal parameters. Changingthis register should be performed when the cache is not actively beingoperated.

The clock timing bits of the control register provide the ability toadjust the clock timing of the Zeus processor. Writing a zero (0) tothis field sets the clock timing to its slowest state, enhancingreliability but limiting clock rate. Writing a seven (7) to this fieldsets the clock timing to its fastest state, limiting reliability butenhancing performance. On a power on reset, bus reset, or machine check,this field is cleared (0), thus providing operation at low clock rate.The internal clock rate is set to (clock timing+1)/2*(external clockrate). Changing this register should be performed along with a controlregister reset.

The global access bits of the control register determine whether a localTB miss cause an exceptions or treatment as a global address. A singlebit, selected by the privilege level active for the access from four bitconfiguration register field, “Global Access,” (GA) determines theresult. If GA_(PL), is zero (0), the failure causes an exception, if itis one (1), the failure causes the address to be used as a globaladdress directly.

The niche limit bits of the control register determine which cache linesare used for cache access, and which lines are used for niche access.For addresses pa14..8<n1, a 7-bit address modifier register am isinclusive-or'ed against pa14..8 to determine the cache line. The cachemodifier am must be set to (1^(7-log(128-n1))∥( )^(log(128-n1))) forproper operation. The am value does not appear in a register and isgenerated from the n1 value.

Physical Address

The physical address of the Control Register, byte b is:

Definition

def data ← AccessPhysicalControl(pa,op,wdata) as  case op of   R:   data ← 0⁶⁴ || ControlRegister   W:    ControlRegister ← wdata_(63..0) endcase enddef

Clock

The Zeus processor provides internal clock facilities using threeregisters, a clock cycle register that increments one every cycle, aclock event register that sets the clock bit in the event register, anda clock watchdog register that invokes a clock watchdog machine check.These registers are memory mapped.

Clock Cycle

Each Zeus processor includes a clock that maintainsprocessor-clock-cycle accuracy. The value of the clock cycle register isincremented on every cycle, regardless of the number of instructionsexecuted on that cycle. The clock cycle register is 64-bits long.

For testing purposes the clock cycle register is both readable andwritable, though in normal operation it should be written only at systeminitialization time; there is no mechanism provided for adjusting thevalue in the clock cycle counter without the possibility of losingcycles.

Clock Event

An event is asserted when the value in the clock cycle register is equalto the value in the clock event register, which sets the clock bit inthe event register.

It is required that a sufficient number of bits be implemented in theclock event register so that the comparison with the clock cycleregister overflows no more frequently than once per second. 32 bits issufficient for a 4 GHz clock. The remaining unimplemented bits must bezero whenever read, and ignored on write. Equality is checked onlyagainst bits that are implemented in both the clock cycle and clockevent registers.

For testing purposes the clock event register is both readable andwritable, though in normal operation it is normally written to.

Clock Watchdog

A Machine Check is asserted when the value in the clock cycle registeris equal to the value in the clock watchdog register, which sets thewatchdog timeout bit in the control register.

A Machine Check or a Reset, of any cause including a clock watchdog,disables the clock watchdog machine check. A write to the clock watchdogregister enables the clock watchdog machine check.

It is required that a sufficient number of bits be implemented in theclock watchdog register so that the comparison with the clock cycleregister overflows no more frequently than once per second. 32 bits issufficient for a 4 GHz clock. The remaining unimplemented bits must bezero whenever read, and ignored on write. Equality is checked onlyagainst bits that are implemented in both the clock cycle and clockwatchdog registers.

The clock watchdog register is both readable and writable, though innormal operation it is usually and periodically written with asufficiently large value that the register does not equal the value inthe clock cycle register before the next time it is written.

Physical Address

The Clock registers appear at three different locations, for which threeregisters of the Clock are mapped. The Clock Cycle counter is register0, the Clock Event is register 2, and ClockWatchdog is register 3. Thephysical address of a Clock Register f, byte b is:

Definition

def data ← AccessPhysicalClock(pa,op,wdata) as  f ← pa_(9..8)  case f ||op of   0 || R:    data ← 0⁶⁴ || ClockCycle   0 || W:    ClockCycle ←wdata_(63..0)   2 || R:    data ← 0⁹⁶ || ClockEvent   2 || W:   ClockEvent ← wdata_(31..0)   3 || R:    data ← 0⁹⁶ || ClockWatchdog  3 || W:    ClockWatchdog ← wdata_(31..0)    EnableWatchdog ← 1 endcase enddef def RunClock as  forever   ClockCycle ← ClockCycle + 1  if EnableWatchdog and (ClockCycle_(31..0) = ClockWatchdog_(31..0))then    raise ClockWatchdogMachineCheck   elseif (ClockCycle_(31..0) =ClockEvent_(31..0))then    EventRegister₀ ← 1   endif   wait  endforeverenddef

Tally Tally Counter

Each processor includes two counters that can tally processor-relatedevents or operations. The values of the tally counter registers areincremented on each processor clock cycle in which specified events oroperations occur. The tally counter registers do not signal events.

It is required that a sufficient number of bits be implemented so thatthe tally counter registers overflow no more frequently than once persecond. 32 bits is sufficient for a 4 GHz clock. The remainingunimplemented bits must be zero whenever read, and ignored on write.

For testing purposes each of the tally counter registers are bothreadable and writable, though in normal operation each should be writtenonly at system initialization time; there is no mechanism provided foradjusting the value in the event counter registers without thepossibility of losing counts.

Physical Address

The Tally Counter registers appear at two different locations, for whichthe two registers are mapped. The physical address of a Tally Counterregister f, byte b is:

Tally Control

The tally counter control registers each select one metric for one ofthe tally counters.

Each control register is loaded with a value in one of the followingformats:

flag meaning 0 count instructions issued 1 count instructions retired(differs by branch mispred, exceptions) 2 count cycles in which at leastone instruction is issued 3 count cycles in which next instruction iswaiting for issue

W E X G S L B A: include instructions of these classes

flag meaning 0 count bytes transferred cache/buffer to/from processor 1count bytes transferred memory to/from cache/buffer 2 3 4 count cachehits 5 count cycles in which at least one cache hit occurs 6 count cachemisses 7 count cycles in which at least one cache miss occurs 8 . . . 15

S L A W I: include instructions of these classes (Store, Load, Wide,Instruction fetch)

flag meaning 0 count cycles in which a new instruction is issued 1 countcycles in which an execution unit is busy 2 3 count cycles in which aninstruction is waiting for issuen select unit number for G or A unit

E X T G A: include instructions of these classes (Ensemble, Crossbar,Translate, Group, Address)

event: select event number from event register

Other valid values for the tally control fields are given by thefollowing table:

other meaning 0 count number of instructions waiting to issue each cycle1 count number of instructions waiting in spring each cycle 2 . . . 63Reserved tally control field interpretation

Physical Address

The Tally Control registers appear at two different locations, for whichthe two registers are mapped. The physical address of a Tally Controlregister f, byte b is:

Definition

def data ← AccessPhysicalTally(pa,op,wdata) as  f ← pa₉  case pa₈ || opof   0 || R:    data ← 0⁹⁶ || TallyCounter[f]   0 || W:   TallyCounter[f] ← wdata_(31..0)   1 || R:    data ← 0¹¹² ||TallyControl[f]   1 || W:    TallyControl[f]← wdata_(15..0)  endcaseenddef

Thread Register

The Zeus processor includes a register that effectively contains thecurrent thread number that reads the register. In this way, threadsrunning identical code can discover their own identity.

It is required that a sufficient number of bits be implemented so thateach thread receives a distinct value. Values must be consecutive,unsigned and include a zero value. The remaining unimplemented bits mustbe zero whenever read. Writes to this register are ignored.

Physical Address

The physical address of the Thread Register, byte b is:

Definition

def data ← AccessPhysicalThread(pa,op,wdata) as  case op of   R:    data← 0⁶⁴ || Thread   W:    // nothing  endcase enddef

High-Level Language Accessibility

In one embodiment of the invention, all processor, memory, and interfaceresources directly accessible to high-level language programs. In oneembodiment, memory is byte-addressed, using either little-endian orbig-endian byte ordering. In one embodiment, for consistency with thebit ordering, and for compatibility with x86 processors, little-endianbyte ordering is used when an ordering must be selected. In oneembodiment, load and store instructions are available for bothlittle-endian and big-endian byte ordering. In one embodiment, interfaceresources are accessible as memory-mapped registers. In one embodiment,system state is memory mapped, so that it can be manipulated by compiledcode.

In one embodiment, instructions are specified to assemblers and othercode tools in the syntax of an instruction mnemonic (operation code),then optionally white space followed by a list of operands. In oneembodiment, instruction mnemonics listed in this specification are inupper case (capital) letters, assemblers accept either upper case orlower case letters in the instruction mnemonics. In this specification,instruction mnemonics contain periods (“.”) to separate elements to makethem easier to understand; assemblers ignore periods within instructionmnemonics.

In FIGS. 31B, 31D, 32B, 33B, 34B, 35B, 36B, 38B, 38E, 38H, 39B 39F, 40B,41B, 42B, 43B, 43F, 431, 43L, 44A, 44F, 45B, 45H, 46B, 47A, 51B, 52B,53B, 58B, 59B, and 60B-106B, the format of instructions to be presentedto an assembler is illustrated. Following the assembler format, theformat for inclusion of instructions into high-level compiled languagesis 30 indicated. Finally, the detailed structure of the instructionfields, including pseudo code used to connect the assembler and compiledformats to the instruction fields is shown. Further detailed explanationof the formats and instruction decoding is provided in the sectiontitled “Instruction Set.”

In one embodiment, an instruction is specifically defined as a four-bytestructure with the little-endian ordering. In one embodiment,instructions must be aligned on four-byte boundaries. In one embodiment,basic floating-point operations supported in hardware are floating-pointadd, subtract, multiply, divide, square root and conversions amongfloating-point formats and between floating-point and binary integerformats. Software libraries provide other operations required by theANSI/IEEE floating-point standard.

In one embodiment, software conventions are employed at software moduleboundaries, in order to permit the combination of separately compiledcode and to provide standard interfaces between application, library andsystem software. In one embodiment, register usage and procedure callconventions may be modified, simplified or optimized when a singlecompilation encloses procedures within a compilation unit so that theprocedures have no external interfaces. For example, internal proceduresmay permit a greater number of register-passed parameters, or haveregisters allocated to avoid the need to save registers at procedureboundaries, or may use a single stack or data pointer allocation tosuffice for more than one level of procedure call.

In one embodiment, at a procedure call boundary, registers are savedeither by the caller or callee procedure, which provides a mechanism forleaf procedures to avoid needing to save registers. Compilers may chooseto allocate variables into caller or callee saved registers depending onhow their lifetimes overlap with procedure calls.

In one embodiment, procedure parameters are normally allocated inregisters, starting from register 2 up to register 9. These registershold up to 8 parameters, which may each be of any size from one byte tosixteen bytes (hexlet), including floating-point and small structureparameters. Additional parameters are passed in memory, allocated on thestack. For C procedures which use varargs.h or stdarg.h and passparameters to further procedures, the compilers must leave room in thestack memory allocation to save registers 2 through 9 into memorycontiguously with the additional stack memory parameters, so thatprocedures such as doprnt can refer to the parameters as an array.Procedure return values are also allocated in registers, starting fromregister 2 up to register 9. Larger values are passed in memory,allocated on the stack.

In one embodiment, instruction scheduling is performed by a compiler. Inthe manner of software pipelineing, instructions should generally bescheduled so that previous operations can be completed at the time ofissue. When this is not possible, the processor inserts sufficient emptycycles to perform the instructions precisely—explicit no-operationinstructions are not required.

CONCLUSION

Having fully described various embodiments of the invention, thoseskilled in the art will recognize, given the teachings herein, thatnumerous alternatives and equivalents exist which do not depart from theinvention. It is therefore intended that the invention not be limited bythe foregoing description, but only by the appended claims.

1-22. (canceled)
 23. A programmable processor comprising: an instructionpath; a data path; an external interface operable to receive data froman external source and communicate the received data over the data path;a cache operable to retain data communicated between the externalinterface and the data path; a register file operable to receive andstore data from the data path and communicate the stored data to thedata path; and an execution unit coupled to the instruction path and thedata path and operable to: decode a single instruction for selectivelyarranging data, specifying a data selection operand and a first and asecond register each having a register width, the single instructionindependently specifying the first register and the second register, thefirst and second registers providing a plurality of data elements eachhaving an elemental width smaller than the register width, the dataselection operand comprising a plurality of fields each selecting anyone of the plurality of data elements and each field having a value notrestricted by the other fields included in the data selection operand;and provide in parallel the data elements selected by the fields torespective predetermined positions in a catenated result, wherein thepredetermined positions are in the same order as the fields of the dataselection operand.
 24. The processor of claim 23 wherein each field ofthe data selection operand provides a sufficient number of bits tospecify any one of the plurality of data elements.
 25. The processor ofclaim 24 wherein each field of the data selection operand has a width ofn bits, wherein the plurality of data elements comprises 2 dataelements.
 26. The processor of claim 23 wherein the data selectionoperand is provided by a register specified by the single instruction.27. The processor of claim 26 wherein the data selection operand has awidth equal to the specified register width.
 28. The processor of claim23 wherein the catenated result is provided to a register.
 29. Theprocessor of claim 23 wherein the plurality of data elements has acombined width equal to the width of the first register plus the widthof the second register.
 30. The processor of claim 23 wherein theinstruction further specifies a data element width of the plurality ofdata elements.
 31. The processor of claim 23 wherein each data elementhas a width of 8 bits.
 32. The processor of claim 23 wherein thecatenated result has a width of 128 bits.
 33. The processor of claim 23wherein for each field of the data selection operand, a relativelocation of the field within the data selection operand corresponds to arelative location of the predetermined position within the catenatedresult.
 34. The processor of claim 23 wherein the execution unit isfurther operable to: decode a second single instruction specifying athird and a fourth register each containing a plurality offloating-point operands; multiply the plurality of floating pointoperands in the third register by the plurality of floating-pointoperands in the fourth register to produce a plurality of products; andprovide the plurality of products to partitioned fields of a resultregister as a catenated result.
 35. A data processing system comprising:(a) a bus coupling components in the data processing system; (b) anexternal memory coupled to the bus; (c) a programmable microprocessorcoupled to the bus and capable of operation independent of another hostprocessor, the microprocessor comprising: an instruction path; a datapath; an external interface operable to receive data from an externalsource and communicate the received data over the data path; a cacheoperable to retain data communicated between the external interface andthe data path; a register file operable to receive and store data fromthe data path and communicate the stored data to the data path; and anexecution unit coupled to the instruction path and the data path andoperable to: decode a single instruction for selectively arranging data,specifying a data selection operand and a first and a second registereach having a register width, the single instruction independentlyspecifying the first register and the second register, the first andsecond registers providing a plurality of data elements each having anelemental width smaller than the register width, the data selectionoperand comprising a plurality of fields each selecting any one of theplurality of data elements and each field having a value not restrictedby the other fields included in the data selection operand; and providein parallel the data elements selected by the fields to respectivepredetermined positions in a catenated result, wherein the predeterminedpositions are in the same order as the fields of the data selectionoperand.
 36. The system of claim 35 wherein each field of the dataselection operand provides a sufficient number of bits to specify anyone of the plurality of data elements.
 37. The system of claim 36wherein each field of the data selection operand has a width of n bits,wherein the plurality of data elements comprises 2 data elements. 38.The system of claim 35 wherein the data selection operand is provided bya register specified by the single instruction.
 39. The system of claim38 wherein the data selection operand has a width equal to the specifiedregister width.
 40. The system of claim 35 wherein the catenated resultis provided to a register.
 41. The system of claim 35 wherein theplurality of data elements has a combined width equal to the width ofthe first register plus the width of the second register.
 42. The systemof claim 35 wherein the instruction further specifies a data elementwidth of the plurality of data elements.
 43. The system of claim 35wherein each data element has a width of 8 bits.
 44. The system of claim35 wherein the catenated result has a width of 128 bits.
 45. The systemof claim 35 wherein for each field of the data selection operand, arelative location of the field within the data selection operandcorresponds to a relative location of the predetermined position withinthe catenated result.
 46. The system of claim 35 wherein the executionunit is further operable to: decode a second single instructionspecifying a third and a fourth register each containing a plurality offloating-point operands; multiply the plurality of floating pointoperands in the third register by the plurality of floating-pointoperands in the fourth register to produce a plurality of products; andprovide the plurality of products to partitioned fields of a resultregister as a catenated result.
 47. A programmable processor comprising:an instruction path; a data path; a plurality of registers operable toreceive and store data from the data path and communicate the storeddata to the data path; and an execution unit coupled to the instructionpath and the data path and operable to: decode a single instructionspecifying a plurality of registers each having a register width, theplurality of registers independently specified by the single instructionand storing a plurality of data elements each having an elemental widthsmaller than the register width, an index register storing an indexvector comprising a plurality of indices stored in partitioned fields ofthe index register and a destination register; wherein each index in theindex vector comprises a sufficient number of bits to represent a rangeof possible index values, the range of possible index values including adifferent index value for each of the plurality of data elements storedin the plurality of registers, allowing the index to select any dataelement from the plurality of data elements stored in the plurality ofregisters; wherein each index in the index vector has a value notrestricted by the other indices in the index vector; and provide inparallel the data elements selected by the indices to respectivepredetermined positions in the destination register, wherein thepredetermined positions are in the same order as the indices stored inthe partitioned fields of the index register.
 48. The processor setforth in claim 47 wherein the plurality of registers comprises tworegisters.
 49. The processor set forth in claim 47 wherein the number ofindices stored in the index register is equal to the number ofpredetermined positions in the destination register.
 50. The processorset forth in claim 47 wherein the index vector comprises n equal-sizedindices and the destination register comprises n equal-sizedpredetermined positions.
 51. The processor set forth in claim 50 whereinthe index stored in a lowest order set of bits of the index registerprovides a data element to a lowest order set of bits of the destinationregister, the index in a second lowest order set of bits of the indexregister provide a data element to a second lowest order set of bits ofthe destination register and the index stored in a highest order set ofbits of the index register provides a data element to a highest orderset of bits of the destination register.
 52. The processor set forth inclaim 47 wherein the destination register is a 128-bit register.
 53. Aprogrammable processor comprising: an instruction path; a data path; anexternal interface operable to receive data from an external source andcommunicate the received data over the data path; a cache operable toretain data communicated between the external interface and the datapath; a plurality of registers operable to receive and store data fromthe data path and communicate the stored data to the data path; and anexecution unit coupled to the instruction path and the data path andoperable to: decode a single instruction specifying a first registerstoring a first plurality of data elements, a second register storing asecond plurality of data elements, an index register storing an indexvector comprising a plurality of indices stored in partitioned fields ofthe index register and a destination register; wherein the singleinstruction independently specifies the first register and the secondregister; wherein each of the first and second registers has a registerwidth, and each of the first and second plurality of data elements hasan elemental width smaller than the register width; wherein each indexin the index vector comprises a sufficient number of bits to represent arange of possible index values, the range of possible index valuesincluding a different index value for each of the first and secondpluralities of data elements stored in the first and second pluralitiesof registers, allowing the index to select any data element from thefirst and second pluralities of data elements stored in the first andsecond pluralities of registers; wherein each index in the index vectorhas a value not restricted by the other indices in the index vector; andprovide in parallel data elements from the first and second pluralitiesof data elements selected by the indices to respective predeterminedpositions in the destination register, wherein the predeterminedpositions are in the same order as the indices stored in the partitionedfields of the index register, wherein the predetermined positions arecontiguous blocks of bits that take up an entire width of thedestination register.
 54. The processor set forth in claim 53 whereinthe destination register is a 128-bit register.
 55. A data processingsystem comprising: (a) a bus coupling components in the data processingsystem; (b) an external memory coupled to the bus; (c) a programmablemicroprocessor coupled to the bus and capable of operation independentof another host processor, the microprocessor comprising: an instructionpath; a data path; an external interface operable to receive data froman external source and communicate the received data over the data path;a cache operable to retain data communicated between the externalinterface and the data path; a register file operable to receive andstore data from the data path and communicate the stored data to thedata path; and an execution unit coupled to the instruction path and thedata path and operable to: decode a single instruction specifying aplurality of registers each having a register width, the plurality ofregisters independently specified by the single instruction and storinga plurality of data elements each having an elemental width smaller thanthe register width, an index register storing an index vector comprisinga plurality of indices stored in partitioned fields of the indexregister and a destination register; wherein each index in the indexvector comprises a sufficient number of bits to represent a range ofpossible index values, the range of possible index values including adifferent index value for each of the plurality of data elements storedin the plurality of registers, allowing the index to select any dataelement from the plurality of data elements stored in the plurality ofregisters; wherein each index in the index vector has a value notrestricted by the other indices in the index vector; and provide inparallel the data elements selected by the indices to respectivepredetermined positions in the destination register, wherein thepredetermined positions are in the same order as the indices stored inthe partitioned fields of the index register.
 56. The system set forthin claim 55 wherein the plurality of registers comprises two registers.57. The system set forth in claim 55 wherein the plurality of registerscomprises two 64-bit registers storing a combined total of sixteen 8-bitdata elements.
 58. The system set forth in claim 55 wherein the numberof indices stored in the index register is equal to the number ofpredetermined positions in the destination register.
 59. The system setforth in claim 55 wherein the index vector comprises n equal-sizedindices and the destination register comprises n equal-sizedpredetermined positions.
 60. The system set forth in claim 59 whereinthe index stored in a lowest order set of bits of the index registerprovides a data element to a lowest order set of bits of the destinationregister, the index in a second lowest order set of bits of the indexregister provide a data element to a second lowest order set of bits ofthe destination register and the index stored in a highest order set ofbits of the index register provides a data element to a highest orderset of bits of the destination register.
 61. The system set forth inclaim 55 wherein the destination register is a 128-bit register.
 62. Adata processing system comprising: (a) a bus coupling components in thedata processing system; (b) an external memory coupled to the bus; (c) aprogrammable microprocessor coupled to the bus and capable of operationindependent of another host processor, the microprocessor comprising: aninstruction path; a data path; an external interface operable to receivedata from an external source and communicate the received data over thedata path; a cache operable to retain data communicated between theexternal interface and the data path; a register file operable toreceive and store data from the data path and communicate the storeddata to the data path; and an execution unit coupled to the instructionpath and the data path and operable to: decode a single instructionspecifying a first register storing a first plurality of data elements,a second register storing a second plurality of data elements, an indexregister storing an index vector comprising a plurality of indicesstored in partitioned fields of the index register and a destinationregister; wherein the single instruction independently specifies thefirst register and the second register; wherein each of the first andsecond registers has a register width, and each of the first and secondplurality of data elements has an elemental width smaller than theregister width; wherein each index in the index vector comprises asufficient number of bits to represent a range of possible index values,the range of possible index values including a different index value foreach of the first and second pluralities of data elements stored in thefirst and second pluralities of registers, allowing the index to selectany data element from the first and second pluralities of data elementsstored in the first and second pluralities of registers; wherein eachindex in the index vector has a value not restricted by the otherindices in the index vector; and provide in parallel data elements fromthe first and second pluralities of data elements selected by theindices to respective predetermined positions in the destinationregister, wherein the predetermined positions are in the same order asthe indices stored in the partitioned fields of the index register,wherein the predetermined positions are contiguous blocks of bits thattake up an entire width of the destination register.
 63. The system setforth in claim 62 wherein the destination register is a 128-bitregister.